The Initiative for Open Citations and the OpenCitations Corpus

Oxford e-Research Centre
University of Oxford, UK
9th Conference on
Open Access
Scholarly Publishing
Lisbon, Portugal
20 Sept 2017
© David Shotton 2017 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence
david.shotton@opencitations.net
David Shotton
The Initiative for Open Citations
and the OpenCitations Corpus

2013 “Free scholarly citation data!”
Donatello’s
John the Baptist
Fifth Conference on
Open Access
Riga, Latvia
20 September 2013
. . . the voice of one
crying in the wilderness

2016 “Release open citation data!”
Eighth Conference on
Open Access
Virginia, USA
20 September 2016
Dario Taraborelli
Head of Research,
Wikimedia Foundation

2017 The year of success - citation data is freed!
n  Two fantastic success stories
§  The Initiative for Open Citations https://i4oc.org/
§  The OpenCitations Corpus http://opencitations.net
n  While related, these initiatives are separate and distinct
n  Two Italian heros: Dario Taraborelli and Silvio Peroni

Crossref - providing the fundamental infrastructure
https://www.crossref.org/
n  Crossref is the registration agency of Digital Object Identifiers (DOIs) for
scholarly publications (journal articles). Most publishers are members
n  Crossref hold metadata about articles, made available via its REST API
https://www.crossref.org/services/metadata-delivery/rest-api/
n  Crossref has its own heros:
Ed Pentz Executive Director Geoff Bilder Director of Strategic Initiatives

The Initiative for Open Citations
n  The Initiative for Open Citations is a collaboration between scholarly publishers,
researchers, and other interested parties to promote the unrestricted availability
of scholarly citation It does not host citation data!
n  Launched April 6, 2017 Web site https://i4oc.org
n  Spearheaded by Dario Taraborelli of the Wikimedia Foundation
§  with help from Jonathan Dugan, Martin Fenner, Jan Gerlach,
Catriona MacCallum, Daniel Mietchen, Cameron Neylon,
Mark Patterson, Michelle Paulson, Silvio Peroni and myself
n  Six founding organizations:
§  The Wikimedia Foundation, PLOS, eLife, DataCite, OpenCitations,
and the Centre for Culture and Technology at Curtin University
n  Within a short space of time, I4OC has persuaded most of the major scholarly
publishers to make their reference lists open, so that the proportion of all
references submitted to Crossref that are now open has risen from 1% to
over 45%!

Publishers supporting I4OC and opening their references
n  49 scholarly publishers have opened their references, including the following
major ones:
n  Commercial publishers
§  Association for Computing Machinery, BMJ, De Gruyter, eLife, EMBO
Press, Hindawi, IOS Press, PeerJ, Pensoft Publishers, Portland Press,
Public Library of Science, Springer Nature, Taylor & Francis, Wiley
n  University and scholarly presses
§  Cambridge University Press, Cold Spring Harbor Laboratory Press,
Company of Biologists, Edinburgh University Press, MIT Press,
Rockefeller University Press
n  Learned societies
§  American Association for the Advancement of Science (AAAS),
American Physical Society, American Society for Cell Biology,
International Union of Crystallography, Proceedings of the
National Academy of Sciences (PNAS), Royal Society of Chemistry,
The Royal Society

Organizations and institutions who have endorsed I4OC
n  Funders
§  Sloan Foundation, Bill and Melinda Gates Foundation, Jisc, Simons
Foundations Science Sandbox, Wellcome Trust
n  Research organizations
§  Allen Institute for Artificial Intelligence, Microsoft Research
n  Libraries
§  Association of Research Libraries, British Library, California Digital
Library, Harvard Library Office for Scholarly Communication, LIBER,
Max Planck Digital Library
n  Bibliographic / bibliometric organizations
§  Altmetrics, CiteSeerX, DBLP Computer Science Bibliography,
ImpactStory, Zotero
n  Other organizations
§  Dryad Data Repository, Figshare, Internet Archive, Mozilla, OASPA,
Open Knowledge International, OpenAire, ScienceOPEN, Wiki Education
Foundation, Wikimedia Deutchland, Wikimedia UK

I4OC – what’s left to do
n  Almost 50% of Crossref-deposited references, from ~16 million articles, are
now open, leaving about half that are still closed
n  Crossref has over 7000 members, and it’s the long tail of smaller
publisher-members that are not presently opening their references
n  This includes a large number of Open Access publishers!
§  Just because an article is published as Open Access and its references
are available on the publisher’s web site, this is not sufficient for the bulk
harvesting and analysis of citation data
§  Imagine the effort of going to each site in turn and scraping reference lists
presented in a wide variety of differing formats and DTD markups!
n  Many small scholarly publishers are not even members of Crossref
n  But help is at hand:
§  OASPA has a sponsored agreement with Crossref whereby its smaller
members can join Crossref via OASPA, with OASPA covering the cost of
a proportion of their DOIs

How to open references using the Crossref Cited-by service
n  The Crossref Cited-by service is a free service that helps publishers find out who
is citing their articles
n  Publishers submit article reference lists to Crossref along with other metadata
n  However, the Crossref default is that these reference lists are closed, not OPEN!
n  To open their article reference lists, a publisher needs to do one of two things:
§  Either contact support@crossref.org and ask them to turn on reference
distribution for all the DOI prefixes they manage
§  Or, in the article metadata they submit to Crossref, set the
<reference_distribution_opt> span element to “any” for each DOI deposit
where they want to make references openly available
n  It’s that easy!!!

ZooKeys use of Crossref open citation data

The OpenCitations Corpus
n  OpenCitations (http://opencitations.net) is a small infrastructure organization
directed by myself and Silvio Peroni
n  Its primary purpose is to host and develop the OpenCitations Corpus (OCC),
a Linked Open Data repository of scholarly bibliographic citation data
n  A founding member of I4OC, it is distinct and separate from that initiative
n  The first OCC prototype was created at Oxford in 2011 with Jisc funding – see
my 2013 COASP talk in Riga (http://zeeba.tv/the-open-citations-corpus/)
n  A new instance of the OCC, based on our revised metadata schema, was
created by Silvio Peroni and is now running at the University of Bologna
n  It has been ingesting scholarly references continuously since early July 2016
n  OCC now provides the largest RDF collection of open citation data on the Web
§  Currently holds references from ~240,000 citing bibliographic resources
§  Provides >10 million citation links to over 5.5 million cited resources
§  These data are freely available under a CC0 public domain waiver

Source data - reference lists from PubMed Central
n  At present, the ingested reference lists are obtained by processing the XML
sources of papers in the Open Access subset of PubMed Central
n  These are parsed to yield authors, titles, journal names, etc.
§  We ask for the most recent papers first
§  Thus, as citing papers, the OCC mainly includes articles published in
2016 and 2017
n  The identifiers of all the citing papers already processed are stored locally, so
as not to request the same XML source twice
n  We then call several external APIs, including Crossref and ORCID, to obtain
additional metadata describing the citing and cited papers and their authors
n  There are almost 1.7 million OA articles available in PubMed
§  So far we have harvested 14% . . .

The raw reference list data
n  The reference lists extracted from citing papers are made available in JSON:
{ 
"doi": "10.1007/s11892-016-0752-4", 
"pmid": "27168063", 
"pmcid": "PMC4863913", 
"localid": "MED-27168063", 
"curator": "BEE EuropeanPubMedCentralProcessor", 
"source": "http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4863913/fullTextXML", 
"source_provider": "Europe PubMed Central” 
"references": [ 
...
 
{ 
"bibentry": "Chang, KY, Unanue, ER. Prediction of HLA-DQ8beta cell peptidome using 
a computational program and its relationship to autoreactive T cells, 
Int Immunol, 2009, 21, 6, 705, 13, DOI: 10.1093/intimm/dxp039,  
PMID: 19461125", 
"pmid": "19461125", 
"doi": "10.1093/intimm/dxp039", 
"pmcid": "PMC2686615", 
"process_entry": "True” 
}, 
... 
] 
}
The citing paper's metadata and identifiers
A reference in the citing paper's reference list, with its own ids

The SPAR (Semantic Publishing and Referencing) Ontologies
FaBiO, the FRBR-aligned Bibliographic Ontology - an ontology for
describing bibliographic entities (books, articles, etc.)
CiTO, the Citation Typing Ontology - enables the characterization of
citations, both factually and rhetorically
BiRO, the Bibliographic Reference Ontology - an ontology to define
bibliographic records and references, and their compilation into
bibliographic collections and reference lists, respectively
http://www.sparontologies.net/
n  OCC data are then stored in RDF (JSON-LD) using the SPAR (Semantic
Publishing and Referencing) ontologies and other standard vocabularies
n  These SPAR ontologies include

Availability of the OpenCitations Corpus data
n  All the OpenCitations software is available on GitHub under an open license
n  The data in the OpenCitations Corpus are available in three different ways:
§  Direct access to bibliographic resources by means of their HTTP URIs
(via content negotiation), e.g. https://w3id.org/oc/corpus/br/1
§  Queries to our SPARQL endpoint: https://w3id.org/oc/sparql
§  Monthly dumps stored in Figshare: http://opencitations.net/download
n  Currently the OCC uses a good graph-based triplestore – Blazegraph
n  However, the virtual machine that hosts it is very limited in resources,
causing performance problems for demanding SPARQL queries
n  We plan soon to commission a new powerful physical server that should
provide a better user experience, and to develop additional user-friendly
interfaces for accessing the OCC data, including graphic visualizations of
citation networks

Use of the OpenCitations web site
n  Accesses to the OpenCitations web site and services:
The “corpus” and “sparql” pages have together gained 89% of the total accesses, showing that
people mainly access the OpenCitations Corpus to explore and use the data within it

Use of OpenCitations data stored on Figshare

What happened this summer?
n  Use of the OpenCitations social accounts
§  Twitter - https://twitter.com/opencitations
§  Wordpress Blog – https://opencitations.wordpress.com/
increased markedly following the launch of the Initiative for Open Citations

Who is using OpenCitations, and for what?
n  Organizations and projects that we know use OpenCitations resources include:
§  Wikidata - pulling citation data to enrich their pages
§  OpenAIRE – using OCC bibliographic resources info in OpenAIRE
§  LOC-DB - have adopted the OpenCitations data model for their database
§  Tomas Petricek of the Turing Institute - extending his Gamma Project
visualization software to handle OpenCitations’ RDF data
§  Ontotext.com - combining Springer's SciGraph data with OpenCitations
data using SPARQL federation
§  Anna Kamińska of the Polish Librarians Association - undertaking citation
network analysis of PLoS One research papers using data in the OCC
n  We can’t know who else is using OpenCitations resources unless they tell us!
§  Please let us know if you are!
n  On 10th September, Crossref blogged about our use of their REST API
§  https://www.crossref.org/blog/using-the-crossref-rest-api.-part-5-with-
opencitations/

Present status of OpenCitations
n  We have recently received a small
grant from the Sloan Foundation for the
OpenCitations Enhancement Project
§  This provides one year’s salary
for a postdoc to develop new user
interfaces, and new hardware to
enhance the OCC performance
n  We have just appointed Ivan Heibi to
work on the OCC with Silvio in Bologna
n  Silvio and Ivan will be commissioning
the new hardware next month
§  This will use parallel processing
to increase ingest rate 30-fold
n  We are in the process of appointing an
International Advisory Board to guide
the growth of OpenCitations

Enhancing the OpenCitations ingestion rate
n  OpenCitations current ingests ~8 million new citations per year
n  With 30 Raspberry Pis working in parallel as ingest machines, we anticipate
that this rate will increase to ~240 million new citations per year
n  By the end of 2018, OpenCitations should hold ~ 250 million citations,
compared to Web of Knowledge’s ~1.25 billion
n  Even this partial coverage will include citations of all important papers,
these critical papers being easily recognized because they are highly cited,
forming nodes in the citation graph with a large number of inward citation links
n  A further five-fold increase in ingest rate - significant but achievable with
additional hardware (and funding!) - will enable us to reach parity by 2020

Where will the references come from?
n  With the enhanced ingest rate, we will quickly consume all 1.7 million articles
in the Open Access Subset of PubMed Central
n  We will then start harvesting the references from the ~16 million articles
already made open at Crossref in response to the Initiative for Open Citations,
and the additional articles that I4OC now encourages other publishers to open
n  Possible additional significant sources of open citation data include
§  ArXiv (1.3 million preprints)
§  CiteSeerX (>120 million references from >6 million documents)
§  CitEc (11 million references from a million Economics papers)
n  References from pre-digital publications extracted by text mining, e.g.
§  In the Social Sciences, from the LOC-DB at the University of Mannheim
§  In Biological Taxonomy, mined into BioStor by Rod Page from the
Biodiversity Heritage Library, e.g. http://biostor.org/reference/105357

We are winning the battle for open scholarship!
david.shotton@opencitations.net
David Shotton
Silvio Peroni
silvio.peroni@opencitations.net
Website: http://opencitations.net
Email: contact@opencitations.net
Twitter: @opencitations
Blog: https://opencitations.wordpress.com
Website: https://i4oc.org/
Email: info@i4oc.org
Twitter: @i4oc_org
dtaraborelli@wikimedia.org
Dario Taraborelli
Mark Patterson
m.patterson@elifesciences.org
Catriona MacCallum
catriona.maccallum@hindawi.com

The Initiative for Open Citations and the OpenCitations Corpus

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Initiative for Open Citations and the OpenCitations Corpus

Similar to The Initiative for Open Citations and the OpenCitations Corpus (20)

More from University of Bologna

More from University of Bologna (14)

Recently uploaded

Recently uploaded (20)

The Initiative for Open Citations and the OpenCitations Corpus