Tracking citations to research
software via PIDs
!1
Lars Holm Nielsen & Stephanie van de Sandt
CERN, IT Department, Digital Repositories Section
!3
Upload Describe Publish
Zenodo
Zenodo + Software 2014
2019
75% of the world’s software DOIs
!5
The Asclepias project
• Brokering and harvesting scholarly links
• Open citation data in ADS, Crossref/DataCite
Event Data and Europe PMC.
• ~6000 citations to Zenodo records (January
2019)
!6
Author of scholarly manuscripts.
Developer of scientific software.
Roles of researcher
CreditSoftware
Author
Publisher
Discovery
Services
Write paper
Publish paper
Developer Repository
Write software
Publish software
The system
Citations
Cite my software as …
Find software
Systemic key issues
!8
Developer Repository
Discovery
service
Write software Publish software1. Information loss
2. Dilution of citations
Information loss: Author
• What do I cite? Paper, software, software version
• Citation recommendations
• Reference manager (e.g. BibTeX, Endnote, …)
• Exists? Correct? BibTeX Latency
• “Software” type doesn’t exists.
• No version field support in BibTeX.
• Persistent identifier for software
• zero, one or more?
!9
Include citation in paper
The “Challenge”: a messy world
• Triangle.py
• 10.5281/zenodo.10598

10.5281/zenodo.11020
• Corner.py
• 10.5281/zenodo.45906
• 10.5281/zenodo.53155
• 10.5281/zenodo.591491 (Concept)
• JOSS
• 10.21105/joss.00024
• ASCL (Astronomy Source Code Library)
• https://ascl.net/1702.002
!10
Author
Publisher
Discovery
Services
Write paper
Publish paper
Developer Repository
Write software
Publish software
The system
Citations
Cite my software as …
Find software
Information loss: Publisher
• Policy prohibits software citation.
• Journal authoring system defects:
• Information from BibTeX is lost
• CrossRef DOI ➞ JATS XML ➞ PDF
• Copy editors needs training
• Journal -> Scientific society -> Publisher -> Vendor platform ->
Outsourcing
!12
Include citation in products
Information loss: Metadata quality
• Example (cite arXiv identifiers)
• yymm.nnnnv1 (published 2012)
• yymm.nnnnv7 (published 2017)
• Paper from 2015 cites “yymm.nnnn”
• Result: 2015 paper cites 2017 software

because metadata doesn’t say 2012.
!13
Author
Publisher
Discovery
Services
Write paper
Publish paper
Developer Repository
Write software
Publish software
The system
Citations
Cite my software as …
Find software
Information loss: Discovery Service
• Paper ingest workflow:
• 1) identify link 2) create/update local record? 

3) attribute citation link.
• Policy prohibits software records.
• Ingestion workflow incapable of identifying software
• Non-trivial to identify local record.
!15
Ingest paper and track citations
Discovery service differences
!16
• Europe PMC: 71 different publishers
• Springer, F1000, PLOS, PeerJ,
Pensoft, Frontiers
• Crossref: 57 different publishers
• Springer, F1000, Pensoft, PeerJ,
Wiley
• NASA ADS: 38 different publishers
• arXiv, American Astronomical
Society, Springer, IOP, Oxford
University Press, Elsevier
Discovery service differences
!17
Author
Publisher
Discovery
Services
Write paper
Publish paper
Developer Repository
Write software
Publish software
The system
Citations
Cite my software as …
Find software
Dilution of citations: Developer/Repository
• Persistent identifier: Software, software paper,
discovery system identifier (i.e. zero, one or more PIDs)
• Dynamic authorship
• Software name changes
• Granularity: DOI per version, module, module version…
!19
Ensure software is citable
Dilution of citations: Citation recommendations
!20
Dilution of citations: Citation recommendations
!21
• Loss of specificity:
• “lmfit 0.9.5 or later [4]
was used“
Dilution of citations: BibTeX latency
!22
• trackpy
• Cite what
you used
Author
Publisher
Discovery
Services
Write paper
Publish paper
Developer Repository
Write software
Publish software
Systemic issues
Citations
Cite my software as …
!24
How can we expect researchers to
change culture, if we can’t even
track citations to software?
Generality
• Search/Replace:
• “Software” with “Data”, … (except “Paper”)
• “Astronomy” with “Physics”, …
• Problems:
• Information loss, dilution of citations,
closed proprietary systems.
!25
The “fix” of a chain linked system
!26
Systemic issues need joint effort to be solved.
The “Fix”: Publisher
• Software citation policy
• Authoring system:
• Working with vendor to
produce correct DOI
metadata and JATS
XML (machine
readability).
!27
The “Fix”: Discovery
• Ingestion workflow for
software with DataCite
DOI, handling:
• Synonymous PIDs
• Version relationships
• BibTeX generation fixes
!28
The “Fix”: Repository
• DOI Versioning:
• Version relationships
• Version number field
• DataCite metadata
• Dynamic authorship
• BibTeX generation fixes
• GitHub integration
!29
v1.0 v1.2
SW
The “Challenge”: Roll-up citations for software
• Goal: Proper credit for software
• Roll-up citations for software
• Synonymous PIDs (identifies a resource)
• Version relationships (identifies group resources)
• Citation relationships (links between groups of resources)
• Expert curation (actions in individual systems)
• Information needed by all:
• Discovery systems
• Repositories
• Problem: Share and exchange information about scholarly links.
!30
Software citation today
• Primarily self-citation (~80% of citations)
• Not necessarily bad (SW citation principles)
• Citation count >5 (~2%)
• Generic libraries (neural networks, stats
visualisation, …)
• Citation recommendations in a bad shape
• Each recommendation has a unique story
!31
Software citation today
!32
Citation growth rate
is higher than
Zenodo uploads
growth rate
• Software citation is in a pretty bad shape …but
don’t despair (still infancy)!

• Systemic issues can only be solved with joint efforts

• Problems exposed also impact PIDs knowledge
graphs in general.
!33
Software citation today
Author
Publisher
Discovery
Services
Write paper
Publish paper
Developer Repository
Write software
Publish software
Thanks for listening…
Citations
Cite my software as …

Tracking Citations to Research Software via PIDs

  • 1.
    Tracking citations toresearch software via PIDs !1 Lars Holm Nielsen & Stephanie van de Sandt CERN, IT Department, Digital Repositories Section
  • 3.
  • 4.
    Zenodo + Software2014 2019 75% of the world’s software DOIs
  • 5.
    !5 The Asclepias project •Brokering and harvesting scholarly links • Open citation data in ADS, Crossref/DataCite Event Data and Europe PMC. • ~6000 citations to Zenodo records (January 2019)
  • 6.
    !6 Author of scholarlymanuscripts. Developer of scientific software. Roles of researcher CreditSoftware
  • 7.
    Author Publisher Discovery Services Write paper Publish paper DeveloperRepository Write software Publish software The system Citations Cite my software as … Find software
  • 8.
    Systemic key issues !8 DeveloperRepository Discovery service Write software Publish software1. Information loss 2. Dilution of citations
  • 9.
    Information loss: Author •What do I cite? Paper, software, software version • Citation recommendations • Reference manager (e.g. BibTeX, Endnote, …) • Exists? Correct? BibTeX Latency • “Software” type doesn’t exists. • No version field support in BibTeX. • Persistent identifier for software • zero, one or more? !9 Include citation in paper
  • 10.
    The “Challenge”: amessy world • Triangle.py • 10.5281/zenodo.10598
 10.5281/zenodo.11020 • Corner.py • 10.5281/zenodo.45906 • 10.5281/zenodo.53155 • 10.5281/zenodo.591491 (Concept) • JOSS • 10.21105/joss.00024 • ASCL (Astronomy Source Code Library) • https://ascl.net/1702.002 !10
  • 11.
    Author Publisher Discovery Services Write paper Publish paper DeveloperRepository Write software Publish software The system Citations Cite my software as … Find software
  • 12.
    Information loss: Publisher •Policy prohibits software citation. • Journal authoring system defects: • Information from BibTeX is lost • CrossRef DOI ➞ JATS XML ➞ PDF • Copy editors needs training • Journal -> Scientific society -> Publisher -> Vendor platform -> Outsourcing !12 Include citation in products
  • 13.
    Information loss: Metadataquality • Example (cite arXiv identifiers) • yymm.nnnnv1 (published 2012) • yymm.nnnnv7 (published 2017) • Paper from 2015 cites “yymm.nnnn” • Result: 2015 paper cites 2017 software
 because metadata doesn’t say 2012. !13
  • 14.
    Author Publisher Discovery Services Write paper Publish paper DeveloperRepository Write software Publish software The system Citations Cite my software as … Find software
  • 15.
    Information loss: DiscoveryService • Paper ingest workflow: • 1) identify link 2) create/update local record? 
 3) attribute citation link. • Policy prohibits software records. • Ingestion workflow incapable of identifying software • Non-trivial to identify local record. !15 Ingest paper and track citations
  • 16.
    Discovery service differences !16 •Europe PMC: 71 different publishers • Springer, F1000, PLOS, PeerJ, Pensoft, Frontiers • Crossref: 57 different publishers • Springer, F1000, Pensoft, PeerJ, Wiley • NASA ADS: 38 different publishers • arXiv, American Astronomical Society, Springer, IOP, Oxford University Press, Elsevier
  • 17.
  • 18.
    Author Publisher Discovery Services Write paper Publish paper DeveloperRepository Write software Publish software The system Citations Cite my software as … Find software
  • 19.
    Dilution of citations:Developer/Repository • Persistent identifier: Software, software paper, discovery system identifier (i.e. zero, one or more PIDs) • Dynamic authorship • Software name changes • Granularity: DOI per version, module, module version… !19 Ensure software is citable
  • 20.
    Dilution of citations:Citation recommendations !20
  • 21.
    Dilution of citations:Citation recommendations !21 • Loss of specificity: • “lmfit 0.9.5 or later [4] was used“
  • 22.
    Dilution of citations:BibTeX latency !22 • trackpy • Cite what you used
  • 23.
    Author Publisher Discovery Services Write paper Publish paper DeveloperRepository Write software Publish software Systemic issues Citations Cite my software as …
  • 24.
    !24 How can weexpect researchers to change culture, if we can’t even track citations to software?
  • 25.
    Generality • Search/Replace: • “Software”with “Data”, … (except “Paper”) • “Astronomy” with “Physics”, … • Problems: • Information loss, dilution of citations, closed proprietary systems. !25
  • 26.
    The “fix” ofa chain linked system !26 Systemic issues need joint effort to be solved.
  • 27.
    The “Fix”: Publisher •Software citation policy • Authoring system: • Working with vendor to produce correct DOI metadata and JATS XML (machine readability). !27
  • 28.
    The “Fix”: Discovery •Ingestion workflow for software with DataCite DOI, handling: • Synonymous PIDs • Version relationships • BibTeX generation fixes !28
  • 29.
    The “Fix”: Repository •DOI Versioning: • Version relationships • Version number field • DataCite metadata • Dynamic authorship • BibTeX generation fixes • GitHub integration !29 v1.0 v1.2 SW
  • 30.
    The “Challenge”: Roll-upcitations for software • Goal: Proper credit for software • Roll-up citations for software • Synonymous PIDs (identifies a resource) • Version relationships (identifies group resources) • Citation relationships (links between groups of resources) • Expert curation (actions in individual systems) • Information needed by all: • Discovery systems • Repositories • Problem: Share and exchange information about scholarly links. !30
  • 31.
    Software citation today •Primarily self-citation (~80% of citations) • Not necessarily bad (SW citation principles) • Citation count >5 (~2%) • Generic libraries (neural networks, stats visualisation, …) • Citation recommendations in a bad shape • Each recommendation has a unique story !31
  • 32.
    Software citation today !32 Citationgrowth rate is higher than Zenodo uploads growth rate
  • 33.
    • Software citationis in a pretty bad shape …but don’t despair (still infancy)!
 • Systemic issues can only be solved with joint efforts
 • Problems exposed also impact PIDs knowledge graphs in general. !33 Software citation today
  • 34.
    Author Publisher Discovery Services Write paper Publish paper DeveloperRepository Write software Publish software Thanks for listening… Citations Cite my software as …