Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Initiative for Open Citations and the OpenCitations Corpus

577 views

Published on

Slides of David Shotton's presentation at OASPA 2017 - 20 September 2017, Lisbon, Portugal. These slides describe the Initiative for Open Citations – which is a collaboration between scholarly publishers, researchers, and other interested parties to promote the unrestricted availability of scholarly citation – and OpenCitations – i.e. a small infrastructure organization which hosts and develops the OpenCitations Corpus (OCC), a Linked Open Data repository of scholarly bibliographic citation data.

Published in: Science
  • Be the first to comment

  • Be the first to like this

The Initiative for Open Citations and the OpenCitations Corpus

  1. 1. Oxford e-Research Centre University of Oxford, UK 9th Conference on Open Access Scholarly Publishing Lisbon, Portugal 20 Sept 2017 © David Shotton 2017 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence david.shotton@opencitations.net David Shotton The Initiative for Open Citations and the OpenCitations Corpus
  2. 2. 2013 “Free scholarly citation data!” Donatello’s John the Baptist Fifth Conference on Open Access Scholarly Publishing Riga, Latvia 20 September 2013 . . . the voice of one crying in the wilderness
  3. 3. 2016 “Release open citation data!” Eighth Conference on Open Access Scholarly Publishing Virginia, USA 20 September 2016 Dario Taraborelli Head of Research, Wikimedia Foundation
  4. 4. 2017 The year of success - citation data is freed! n  Two fantastic success stories §  The Initiative for Open Citations https://i4oc.org/ §  The OpenCitations Corpus http://opencitations.net n  While related, these initiatives are separate and distinct n  Two Italian heros: Dario Taraborelli and Silvio Peroni
  5. 5. Crossref - providing the fundamental infrastructure https://www.crossref.org/ n  Crossref is the registration agency of Digital Object Identifiers (DOIs) for scholarly publications (journal articles). Most publishers are members n  Crossref hold metadata about articles, made available via its REST API https://www.crossref.org/services/metadata-delivery/rest-api/ n  Crossref has its own heros: Ed Pentz Executive Director Geoff Bilder Director of Strategic Initiatives
  6. 6. The Initiative for Open Citations n  The Initiative for Open Citations is a collaboration between scholarly publishers, researchers, and other interested parties to promote the unrestricted availability of scholarly citation It does not host citation data! n  Launched April 6, 2017 Web site https://i4oc.org n  Spearheaded by Dario Taraborelli of the Wikimedia Foundation §  with help from Jonathan Dugan, Martin Fenner, Jan Gerlach, Catriona MacCallum, Daniel Mietchen, Cameron Neylon, Mark Patterson, Michelle Paulson, Silvio Peroni and myself n  Six founding organizations: §  The Wikimedia Foundation, PLOS, eLife, DataCite, OpenCitations, and the Centre for Culture and Technology at Curtin University n  Within a short space of time, I4OC has persuaded most of the major scholarly publishers to make their reference lists open, so that the proportion of all references submitted to Crossref that are now open has risen from 1% to over 45%!
  7. 7. Publishers supporting I4OC and opening their references n  49 scholarly publishers have opened their references, including the following major ones: n  Commercial publishers §  Association for Computing Machinery, BMJ, De Gruyter, eLife, EMBO Press, Hindawi, IOS Press, PeerJ, Pensoft Publishers, Portland Press, Public Library of Science, Springer Nature, Taylor & Francis, Wiley n  University and scholarly presses §  Cambridge University Press, Cold Spring Harbor Laboratory Press, Company of Biologists, Edinburgh University Press, MIT Press, Rockefeller University Press n  Learned societies §  American Association for the Advancement of Science (AAAS), American Physical Society, American Society for Cell Biology, International Union of Crystallography, Proceedings of the National Academy of Sciences (PNAS), Royal Society of Chemistry, The Royal Society
  8. 8. Organizations and institutions who have endorsed I4OC n  Funders §  Sloan Foundation, Bill and Melinda Gates Foundation, Jisc, Simons Foundations Science Sandbox, Wellcome Trust n  Research organizations §  Allen Institute for Artificial Intelligence, Microsoft Research n  Libraries §  Association of Research Libraries, British Library, California Digital Library, Harvard Library Office for Scholarly Communication, LIBER, Max Planck Digital Library n  Bibliographic / bibliometric organizations §  Altmetrics, CiteSeerX, DBLP Computer Science Bibliography, ImpactStory, Zotero n  Other organizations §  Dryad Data Repository, Figshare, Internet Archive, Mozilla, OASPA, Open Knowledge International, OpenAire, ScienceOPEN, Wiki Education Foundation, Wikimedia Deutchland, Wikimedia UK
  9. 9. I4OC – what’s left to do n  Almost 50% of Crossref-deposited references, from ~16 million articles, are now open, leaving about half that are still closed n  Crossref has over 7000 members, and it’s the long tail of smaller publisher-members that are not presently opening their references n  This includes a large number of Open Access publishers! §  Just because an article is published as Open Access and its references are available on the publisher’s web site, this is not sufficient for the bulk harvesting and analysis of citation data §  Imagine the effort of going to each site in turn and scraping reference lists presented in a wide variety of differing formats and DTD markups! n  Many small scholarly publishers are not even members of Crossref n  But help is at hand: §  OASPA has a sponsored agreement with Crossref whereby its smaller members can join Crossref via OASPA, with OASPA covering the cost of a proportion of their DOIs
  10. 10. How to open references using the Crossref Cited-by service n  The Crossref Cited-by service is a free service that helps publishers find out who is citing their articles n  Publishers submit article reference lists to Crossref along with other metadata n  However, the Crossref default is that these reference lists are closed, not OPEN! n  To open their article reference lists, a publisher needs to do one of two things: §  Either contact support@crossref.org and ask them to turn on reference distribution for all the DOI prefixes they manage §  Or, in the article metadata they submit to Crossref, set the <reference_distribution_opt> span element to “any” for each DOI deposit where they want to make references openly available n  It’s that easy!!!
  11. 11. ZooKeys use of Crossref open citation data
  12. 12. The OpenCitations Corpus n  OpenCitations (http://opencitations.net) is a small infrastructure organization directed by myself and Silvio Peroni n  Its primary purpose is to host and develop the OpenCitations Corpus (OCC), a Linked Open Data repository of scholarly bibliographic citation data n  A founding member of I4OC, it is distinct and separate from that initiative n  The first OCC prototype was created at Oxford in 2011 with Jisc funding – see my 2013 COASP talk in Riga (http://zeeba.tv/the-open-citations-corpus/) n  A new instance of the OCC, based on our revised metadata schema, was created by Silvio Peroni and is now running at the University of Bologna n  It has been ingesting scholarly references continuously since early July 2016 n  OCC now provides the largest RDF collection of open citation data on the Web §  Currently holds references from ~240,000 citing bibliographic resources §  Provides >10 million citation links to over 5.5 million cited resources §  These data are freely available under a CC0 public domain waiver
  13. 13. Source data - reference lists from PubMed Central n  At present, the ingested reference lists are obtained by processing the XML sources of papers in the Open Access subset of PubMed Central n  These are parsed to yield authors, titles, journal names, etc. §  We ask for the most recent papers first §  Thus, as citing papers, the OCC mainly includes articles published in 2016 and 2017 n  The identifiers of all the citing papers already processed are stored locally, so as not to request the same XML source twice n  We then call several external APIs, including Crossref and ORCID, to obtain additional metadata describing the citing and cited papers and their authors n  There are almost 1.7 million OA articles available in PubMed §  So far we have harvested 14% . . .
  14. 14. The raw reference list data n  The reference lists extracted from citing papers are made available in JSON: {
 "doi": "10.1007/s11892-016-0752-4",
 "pmid": "27168063",
 "pmcid": "PMC4863913",
 "localid": "MED-27168063",
 "curator": "BEE EuropeanPubMedCentralProcessor",
 "source": "http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4863913/fullTextXML",
 "source_provider": "Europe PubMed Central”
 "references": [
 ... 
 {
 "bibentry": "Chang, KY, Unanue, ER. Prediction of HLA-DQ8beta cell peptidome using
 a computational program and its relationship to autoreactive T cells,
 Int Immunol, 2009, 21, 6, 705, 13, DOI: 10.1093/intimm/dxp039, 
 PMID: 19461125",
 "pmid": "19461125",
 "doi": "10.1093/intimm/dxp039",
 "pmcid": "PMC2686615",
 "process_entry": "True”
 },
 ...
 ]
 } The citing paper's metadata and identifiers A reference in the citing paper's reference list, with its own ids
  15. 15. The SPAR (Semantic Publishing and Referencing) Ontologies FaBiO, the FRBR-aligned Bibliographic Ontology - an ontology for describing bibliographic entities (books, articles, etc.) CiTO, the Citation Typing Ontology - enables the characterization of citations, both factually and rhetorically BiRO, the Bibliographic Reference Ontology - an ontology to define bibliographic records and references, and their compilation into bibliographic collections and reference lists, respectively http://www.sparontologies.net/ n  OCC data are then stored in RDF (JSON-LD) using the SPAR (Semantic Publishing and Referencing) ontologies and other standard vocabularies n  These SPAR ontologies include
  16. 16. Availability of the OpenCitations Corpus data n  All the OpenCitations software is available on GitHub under an open license n  The data in the OpenCitations Corpus are available in three different ways: §  Direct access to bibliographic resources by means of their HTTP URIs (via content negotiation), e.g. https://w3id.org/oc/corpus/br/1 §  Queries to our SPARQL endpoint: https://w3id.org/oc/sparql §  Monthly dumps stored in Figshare: http://opencitations.net/download n  Currently the OCC uses a good graph-based triplestore – Blazegraph n  However, the virtual machine that hosts it is very limited in resources, causing performance problems for demanding SPARQL queries n  We plan soon to commission a new powerful physical server that should provide a better user experience, and to develop additional user-friendly interfaces for accessing the OCC data, including graphic visualizations of citation networks
  17. 17. Use of the OpenCitations web site n  Accesses to the OpenCitations web site and services: The “corpus” and “sparql” pages have together gained 89% of the total accesses, showing that people mainly access the OpenCitations Corpus to explore and use the data within it
  18. 18. Use of OpenCitations data stored on Figshare
  19. 19. What happened this summer? n  Use of the OpenCitations social accounts §  Twitter - https://twitter.com/opencitations §  Wordpress Blog – https://opencitations.wordpress.com/ increased markedly following the launch of the Initiative for Open Citations
  20. 20. Who is using OpenCitations, and for what? n  Organizations and projects that we know use OpenCitations resources include: §  Wikidata - pulling citation data to enrich their pages §  OpenAIRE – using OCC bibliographic resources info in OpenAIRE §  LOC-DB - have adopted the OpenCitations data model for their database §  Tomas Petricek of the Turing Institute - extending his Gamma Project visualization software to handle OpenCitations’ RDF data §  Ontotext.com - combining Springer's SciGraph data with OpenCitations data using SPARQL federation §  Anna Kamińska of the Polish Librarians Association - undertaking citation network analysis of PLoS One research papers using data in the OCC n  We can’t know who else is using OpenCitations resources unless they tell us! §  Please let us know if you are! n  On 10th September, Crossref blogged about our use of their REST API §  https://www.crossref.org/blog/using-the-crossref-rest-api.-part-5-with- opencitations/
  21. 21. Present status of OpenCitations n  We have recently received a small grant from the Sloan Foundation for the OpenCitations Enhancement Project §  This provides one year’s salary for a postdoc to develop new user interfaces, and new hardware to enhance the OCC performance n  We have just appointed Ivan Heibi to work on the OCC with Silvio in Bologna n  Silvio and Ivan will be commissioning the new hardware next month §  This will use parallel processing to increase ingest rate 30-fold n  We are in the process of appointing an International Advisory Board to guide the growth of OpenCitations
  22. 22. Enhancing the OpenCitations ingestion rate n  OpenCitations current ingests ~8 million new citations per year n  With 30 Raspberry Pis working in parallel as ingest machines, we anticipate that this rate will increase to ~240 million new citations per year n  By the end of 2018, OpenCitations should hold ~ 250 million citations, compared to Web of Knowledge’s ~1.25 billion n  Even this partial coverage will include citations of all important papers, these critical papers being easily recognized because they are highly cited, forming nodes in the citation graph with a large number of inward citation links n  A further five-fold increase in ingest rate - significant but achievable with additional hardware (and funding!) - will enable us to reach parity by 2020
  23. 23. Where will the references come from? n  With the enhanced ingest rate, we will quickly consume all 1.7 million articles in the Open Access Subset of PubMed Central n  We will then start harvesting the references from the ~16 million articles already made open at Crossref in response to the Initiative for Open Citations, and the additional articles that I4OC now encourages other publishers to open n  Possible additional significant sources of open citation data include §  ArXiv (1.3 million preprints) §  CiteSeerX (>120 million references from >6 million documents) §  CitEc (11 million references from a million Economics papers) n  References from pre-digital publications extracted by text mining, e.g. §  In the Social Sciences, from the LOC-DB at the University of Mannheim §  In Biological Taxonomy, mined into BioStor by Rod Page from the Biodiversity Heritage Library, e.g. http://biostor.org/reference/105357
  24. 24. We are winning the battle for open scholarship! david.shotton@opencitations.net David Shotton Silvio Peroni silvio.peroni@opencitations.net Website: http://opencitations.net Email: contact@opencitations.net Twitter: @opencitations Blog: https://opencitations.wordpress.com Website: https://i4oc.org/ Email: info@i4oc.org Twitter: @i4oc_org dtaraborelli@wikimedia.org Dario Taraborelli Mark Patterson m.patterson@elifesciences.org Catriona MacCallum catriona.maccallum@hindawi.com

×