Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.


One year ago we started ingesting citation data from the Open Access literature into the OpenCitations Corpus (OCC), creating an RDF dataset of scholarly citation data that is open to all. In this presentation we introduce the OCC and we discuss its outcomes and uses after the first year of life.

  • Login to see the comments

  • Be the first to like this


  1. 1. Oxford e-Research Centre University of Oxford, UK WikiCite 2017 Vienna, Austria 23 May 2017 © David Shotton and Silvio Peroni, 2017 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence David Shotton Silvio Peroni Dept. Computer Science and Engineering University of Bologna, Italy
  2. 2. What is a citation? n  The performative act of citing a published work that is relevant to the current work, typically made by including a reference in a reference list Why are citations important? n  The act of bibliographic citation is central to scholarly communication – bibliographic references are the links that knit together independent scholarship n  Citations unify the whole world of scholarship into a giant citation network n  Citation networks reveal the development of academic disciplines n  Sir Isaac Newton: “If I have seen a little further, it is by standing on the shoulders of Giants”
  3. 3. How is the present situation imperfect? n  The present scholarly citation system inadequately exposes the knowledge networks that exist within the scholarly literature n  Citation data are hidden behind subscription firewalls of commercial companies n  Academics are not free to use their own citation data as they please n  In this Open Access age, it is a scandal that reference lists from journal articles, the core elements of the academic data cycle, are not freely available for use by the scholars who created them n  Citation data now need to be recognized as a part of the Commons – those works that are freely and legally available for sharing n  To address this issue, we have developed The OpenCitations Corpus
  4. 4. How this came about - 2009 adventures in semantic publishing
  5. 5. The SPAR (Semantic Publishing and Referencing) Ontologies FaBiO, the FRBR-aligned Bibliographic Ontology - an ontology for describing bibliographic entities (books, articles, etc.) CiTO, the Citation Typing Ontology - enable characterization of citations, both factually and rhetorically BiRO, the Bibliographic Reference Ontology - an ontology to define bibliographic records and references, and their compilation into bibliographic collections and reference lists, respectively C4O, the Citation Counting and Context Characterization Ontology DoCO, the Document Components Ontology PRO, the Publishing Roles Ontology PSO, the Publishing Status Ontology PWO, the Publishing Workflow . . . and now others
  6. 6. The OpenCitations Corpus n  The OpenCitations Corpus is a Linked Open Data repository of scholarly bibliographic citation data described using the SPAR ontologies n  Prototype created at Oxford in 2011 by Alex Dutton with JISC funding n  A new instantiation created by Silvio at the University of Bologna in late 2015 §  based on a revised metadata schema, with automated daily ingestion of citations from authoritative sources n  OCC now provides the largest RDF collection of open citation data on the Web §  currently holds the references from ~150,000 citing bibliographic resources §  providing ~6.7 million citation links to over 4 million cited resources n  These citations are encoded using the SPAR ontologies, and are freely available under a CC0 public domain waiver from n  The OpenCitations Enhancement Project has just been funded by the Sloan Foundation, to enhance ingest rates and provide smart data visualization interfaces
  7. 7. Ingestion workflow n  We developed several scripts for implementing the ingestion workflow that populates the OpenCitations Corpus n  All the software is available on the OpenCitations GitHub repository §  Released as open source code with the ISC License n  These scripts implement a live and iterative process n  Why live? §  It is working while I’m speaking §  It does not sleep, never §  It is like a sentient, relentless, fast zombie – watch out! n  Why iterative? The ingestion workflow continuously calls several external APIs to obtain new reference lists and clean metadata of the citing and cited papers O C
  8. 8. Reference lists from PubMed Central n  At present, all the reference lists are taken by processing the XML sources of the papers in the PubMed Central Open Access subset n  We use the Europe PubMed Central API for retrieving the XML sources §  We ask for the most recent papers first §  Thus, as citing papers, the OCC mainly includes articles published in 2016 and 2017 n  There are 1.58M OA articles available in PubMed, according to their API §  We have harvested 10% so far . . . n  The identifiers of all the citing papers that we have been already processed by the ingestion workflow are stored locally, so as not to request the same XML source twice
  9. 9. Metadata from Crossref and ORCID n  The reference lists extracted from citing papers are made available in JSON: {
 "doi": "10.1007/s11892-016-0752-4",
 "pmid": "27168063",
 "pmcid": "PMC4863913",
 "localid": "MED-27168063",
 "curator": "BEE EuropeanPubMedCentralProcessor",
 "source": "",
 "source_provider": "Europe PubMed Central”
 "references": [
 "bibentry": "Chang, KY, Unanue, ER. Prediction of HLA-DQ8beta cell peptidome using
 a computational program and its relationship to autoreactive T cells,
 Int Immunol, 2009, 21, 6, 705, 13, DOI: 10.1093/intimm/dxp039, 
 PMID: 19461125",
 "pmid": "19461125",
 "doi": "10.1093/intimm/dxp039",
 "pmcid": "PMC2686615",
 "process_entry": "True”
 } n  We then call the Crossref APIs to obtain additional information (title, authors, venues, etc.) about the citing paper and about those papers described in the reference list, and then call the ORCID APIs to obtain ORCIDs of the authors The citing paper's metadata and identifiers A reference in the citing paper's reference list, with its own ids
  10. 10. The OpenCitations Corpus data model n  Available at n  Implemented in the OpenCitations Ontology (OCO, §  It is not yet another bibliographic ontology, but rather simply a mechanism for grouping together existing complementary ontological entities from several other ontologies (e.g. SPAR and FOAF)
  11. 11. Resources included within the Corpus (as of 26 April 2017) Entity type What it describes Count in the OCC Bibliographic resource (br) Conference papers, book chapters, journal articles, academic proceedings, books, journals, etc. 5.1 million Resource embodiment (re) Digital vs. print, first and ending pages, etc. 2.9 million Bibliographic entry (be) Textual content of a reference in a reference list 6 million Responsible agent (ra) Given name, family name and ORCID of the agent involved 15.8 million Agent role (ar) Author, publisher, etc. 20 million Identifier (id) DOI, PubMed ID, PubMed Central ID, ORCID, ISSN, etc. 10.4 million
  12. 12. OpenCitations in the wild n  Twitter: n  Blog: n  The data in the OpenCitations Corpus are available in three different ways: §  Direct access to bibliographic resources by means of their HTTP URIs (via content negotiation, e.g. §  SPARQL endpoint: §  Monthly dumps: (stored in Figshare) Figshare statistics as of 8 May 2017
  13. 13. Third-party usage of OpenCitations n  Projects that use OpenCitations resources: §  Wikidata §  OpenAIRE §  LOC-DB §  Others? Please let us know! n  Accesses to the OpenCitations website and services: The pages relating to the data available (“corpus”) and the service for querying them (“sparql”) have together gained 88% of the overall accesses, showing that the main reason people access the OpenCitations website is to explore and use the data in the OpenCitations Corpus
  14. 14. What happened in the past month n  Use of the OpenCitations social accounts (Twitter, Blog on Wordpress) increased markedly during the past month n  What happened?
  15. 15. Initiative for Open Citations (I4OC) n  The Initiative for Open Citations (I4OC, is a collaboration between scholarly publishers, researchers, and other interested parties to promote the unrestricted availability of scholarly citation data n  Founders: n  Aim: promote the availability of structured, separable, and open citation data n  How: asking publishers §  to submit article metadata (including reference lists) to Crossref Cited-by service §  to allow Crossref to open the reference lists to the public n  Achievement: as of March 2017, publications with open references freely available in Crossref has grown from 1% to more than 40% OpenCitations is one of the founder
  16. 16. The OpenCitations ingestion rate: an update About 500,000 new citations links added per month per day New infrastructure coming soon (thanks to the OpenCitations Enhancement Project just funded by the Sloan Foundation) The OpenCitations will have ~190 million citation links after one year of processing with the new infrastructure
  17. 17. Thank you for your attention David Shotton Silvio Peroni Website: Email: Twitter: @opencitations Blog: Github: Contacts