I am a cell biologist with 35 years of undergraduate teaching experience. Five years ago, I gave up laboratory work entirely, and swapped making videos of lymphocytes killing virally-infected cells to concentrate on research data management, metadata models and ontologies. My wife thinks I ’ m crazy. We have already heard today about the many benefits of publishing research data openly for re-use, and I will not rehearse those points. Rather, I want to talk to you this afternoon about why researchers at present don ’ t manage their research data well, and how we can improve the situation, drawing from the activities and experience of my own research group in the Department of Zoology in Oxford. So, I want to exemplify tools and systems that permit local data management, enable data storage in repositories, assist in the creation of rich metadata to accompany and describe datasets, and enable data citation. All the work I will be describing has been made possible through JISC funding of four projects on which I am Principle Investigator.
I have illustrated tools and systems we have been working on that facilitate managing, publishing, describing and citing data . These are all open source, and freely available for use. In particular, I would encourage Pro-Vice Chancellors for Research, IT managers and similar institutional decision-makers to contact me if you would like to explore the use of DataStage for your research groups ’ local data management requirements, and DataBank as a solution to your institutional data repository requirements.
David Shotton - Research Integrity: Integrity of the published record
Why don’t researchers publish data? <ul><li>Three pressures presently prevent researchers from publishing their data </li></ul><ul><li>Information overload and pressure of work </li></ul><ul><ul><li>With twenty new papers each week, a researcher can never catch up – there is just too much new scientific information being produced now </li></ul></ul><ul><ul><li>Have to run to stand still - no time for ‘ fringe ’ activities like data curation </li></ul></ul><ul><li>Departmental pressure for financial viability , determined by the REF </li></ul><ul><ul><li>pressure to win grants and to publish in high impact journals </li></ul></ul><ul><ul><li>negligible incentives and academic reward in terms of peer esteem, tenure or promotion for data publication activities </li></ul></ul><ul><li>Cognitive overhead and skill barriers to best-practice data management </li></ul><ul><ul><li>metadata concepts foreign to most biomedical researchers </li></ul></ul><ul><ul><li>large amount of effort involved in preparing data for publication </li></ul></ul><ul><li>[From evidence submitted 5 August 2011 to the Royal Society ’ s Science as a Public Enterprise policy study] </li></ul>
1 Managing data <ul><li>In the JISC ADMIRAL Project (A Data Management Infrastructure for Research Across the Life Sciences) , we developed a two-tier data management system </li></ul><ul><li>Locally, researchers save files to a secure private DataStage file store </li></ul><ul><ul><li>This is for their own benefit (file management, regular backup, controlled access, Web interface, etc.) </li></ul></ul><ul><ul><li>and does not pose a cognitive overhead – “ sheer curation ” </li></ul></ul><ul><li>We then provide a Web interface that permits researchers to select and package datasets for publication and long-term repository archiving </li></ul><ul><ul><li>Easy to do, when the researcher is ready, with minimal metadata </li></ul></ul><ul><li>Finally, data can be published to the Oxford DataBank institutional repository </li></ul><ul><ul><li>Run by the Bodleian Library, with a track record in preservation </li></ul></ul><ul><ul><li>Easy for researcher to update a revised dataset if required </li></ul></ul><ul><ul><li>Optional embargo period to permit prior journal article publication </li></ul></ul><ul><ul><li>Data packages assigned DOIs, making them citable (for academic credit) </li></ul></ul>
2 Publishing data on the cloud <ul><li>In the new JISC UMF DataFlow Project , we are now adapting DataStage and DataFlow for third-party research groups and institutions to deploy and use </li></ul><ul><ul><li>to run on the Eduserv Academic Cloud (or on another cloud, or locally) </li></ul></ul><ul><ul><li>hardened as professional VMWare virtual machine software appliances </li></ul></ul><ul><ul><li>installation designed to be easy and customizable (e.g. your name & logo) </li></ul></ul><ul><ul><li>enabling institutions to provide their members with zero cost data management solutions (apart from cloud hosting costs) </li></ul></ul><ul><ul><ul><li>cloud provision can expand and shrink with requirements </li></ul></ul></ul><ul><ul><ul><li>no need to build and staff your own local data centre </li></ul></ul></ul><ul><li>We have just set up alpha versions on a local cloud in Oxford, for test use </li></ul><ul><ul><li>We welcome interest from potential test users </li></ul></ul><ul><ul><li>(University of Leeds has just installed and is testing its own DataBank) </li></ul></ul><ul><li>We will have beta versions by Christmas and production versions by Q2 2012 </li></ul>http://www.dataflow.ox.ac.uk/
The JISC DRYAD-UK Project http://datadryad.org/ <ul><li>An alternative to an institutional repository is a subject-specific repository </li></ul><ul><li>Dryad is a repository for datasets underlying biomedical scientific research articles </li></ul><ul><li>Its initial focus was in evolution and ecology </li></ul><ul><li>Dryad was originally developed at the University of North Carolina, with funding from the National Science Foundation </li></ul><ul><li>The JISC Dryad-UK Project has been working over the past year </li></ul><ul><ul><li>to mirror the Dryad data repository at the British Library </li></ul></ul><ul><ul><li>to add new journals, expanding into other areas of biology and medicine, particularly infectious disease – 25 added to date, with more coming </li></ul></ul><ul><ul><li>to plan and facilitate Dryad ’ s financial sustainability from journal fees </li></ul></ul><ul><li>Once an article has been published, Dryad publishes the related datasets, using metadata provided by the journal publisher </li></ul><ul><li>Data packages are assigned DOIs and published with Creative Commons CC-Zero open data licenses, to enable free re-use of the datasets </li></ul>
3 Describing data <ul><li>At present, DataBank and Dryad imposes minimal metadata requirements </li></ul><ul><li>The DataCite mandatory metadata properties required for DOI assignment: </li></ul><ul><ul><li>Identifier (the DOI) </li></ul></ul><ul><ul><li>Creator (i.e. authors) </li></ul></ul><ul><ul><li>Title </li></ul></ul><ul><ul><li>Publisher (i.e. repository name “Dryad Data Repository”) </li></ul></ul><ul><ul><li>Publication Year </li></ul></ul><ul><li>As part of the JISC Dryad-UK Project , we set out to investigate whether we could enable the creation of richer metadata without too much effort, providing better data descriptions that would assist discovery and reuse </li></ul><ul><li>We also wanted to enable the publication of such DataBank and Dryad metadata as Open Linked Data , encoded in RDF, the machine-readable data description language used on the Web </li></ul><ul><li>The particular focus for our enhanced metadata is infectious disease data </li></ul>
Enhancing metadata – the Reis et al . (2008) exemplar http://dx.doi.org/10.1371/journal.pntd.0000228.x001
Rhetorical metadata in the Study Summary <ul><li>The problem with this summary is that </li></ul><ul><ul><li>it is hand-crafted by a single individual </li></ul></ul><ul><ul><li>it is not backed by any recognised metadata standard </li></ul></ul><ul><ul><li>it is only human-readable, lacking an ontology-based machine-readable RDF representation </li></ul></ul>
MIIDI http://www.miidi.org/ <ul><li>MIIDI is a M inimal I nformation standard for an I nfectious D isease I nvestigation </li></ul><ul><li>An international MIIDI workshop in September 2009 led to an initial draft </li></ul><ul><li>In January 2011, Tanya Gray started work with me to develop MIIDI properly </li></ul><ul><li>She has now develop MIIDI into a validated XML data model, and has created MIIDI Forms that permits easy metadata entry conforming to the MIIDI standard </li></ul><ul><li>To permit encoding of MIIDI metadata terms in RDF, we have mapped them to appropriate ontologies, including IDO, the Infectious Disease Ontology </li></ul><ul><li>The MIIDI standard can be used to create rich metadata both for journal articles and also for data sets , such as those held in DataBank or Dryad repositories </li></ul><ul><li>The methodology is generic, and we hope to see it adopted for use in combination with other metadata standards, e.g. those under the umbrella of MIBBI - Minimum Information for Biological and Biomedical Investigations </li></ul>
‘ Disease’ section of the MIIDI Report for Reis et al . 2008
4 Citing data <ul><li>At present, published datasets are poorly cited in the scientific literature </li></ul><ul><li>A survey of PLoS journal articles related to Dryad datasets showed that </li></ul><ul><ul><li>most papers lacked any reference to Dryad, and </li></ul></ul><ul><ul><li>the others only have unstructured citations within the body text, e.g. </li></ul></ul><ul><ul><ul><li>“ A selection of the 30,000 structures is represented in Fig. 1 and a repository, with their all-atom configuration, is available at http://dx.doi.org/10.5061/dryad.1922 . ” </li></ul></ul></ul><ul><ul><ul><li>“ Raw microsatellite data generated in this study have been deposited in the Dryad database ( http://www.datadryad.org ) under accession number 1540. ” </li></ul></ul></ul><ul><ul><ul><li>“ Initiatives such as Dryad ( http://datadryad.org/repo ) (where the data in this study are published) should mean that literature data become easier to gather and maintain in the future. ” </li></ul></ul></ul><ul><li>None of the papers had a proper data reference in the reference list </li></ul>
Best practice for the citation of Dryad datasets <ul><li>I have proposed best practice for citing datasets , available in a discussion paper at http://bit.ly/lt7VsM , recommending: </li></ul><ul><li>That the citation style for referencing on-line data should be as similar as possible to that used for referencing scholarly articles </li></ul><ul><ul><li>Creator (PublicationYear) Title. Publisher. Identifier. </li></ul></ul><ul><li>That the preferred data identifier to be used is a Digital Object Identifier or, if that is not available, the unique accession number or identifier used by the data repository or database in which the data resides </li></ul><ul><li>That this reference be included in the paper’s reference list </li></ul><ul><li>That this data reference in the reference list should be denoted by an appropriate in-text citation , including an in-text reference pointer </li></ul>
Example of best practice for the citation of a Dryad dataset <ul><li>Example in-text citation and in-text reference pointer : "The raw data underpinning this analysis are deposited in the Dryad Data Repository at http://dx.doi.org/10.5061/dryad.8684 (Vijendravarma et al., 2011)." </li></ul><ul><li>Example data reference in the article’s reference list : Vijendravarma RK, Narasimha S, Kawecki TJ (2011). Data from: Plastic and evolutionary responses of cell size and number to larval malnutrition in Drosophila melanogaster . Dryad Digital Repository. doi:10.5061/dryad.8684 ." </li></ul><ul><li>These recommendations have been adopted in the Data Publishing Policies and Guidelines for Biodiversity Data of the publisher Pensoft , available at </li></ul><ul><ul><li>http://www.pensoft.net/J_FILES/ </li></ul></ul><ul><ul><li>Pensoft_Data_Publishing_Policies_and_Guidelines.pdf </li></ul></ul>
The JISC Open Citations Project - publishing bibliographic and data citations as Linked Open Data <ul><li>The problem </li></ul><ul><li>Citation data are hard to find, locked in the reference lists of copyright articles </li></ul><ul><li>Scope, vision and aim of the Open Citation Project </li></ul><ul><li>The Open Citations Project is global in scope, designed to change the face of scientific publishing and scholarly communication </li></ul><ul><li>Its vision is to publish citation data openly as Linked Open Data </li></ul><ul><li>It aims to make citation links as easy to traverse as Web links </li></ul><ul><li>Potential benefits of Open Citations </li></ul><ul><li>Cited works are more easily discovered </li></ul><ul><li>Citation networks can be explored to study the growth of knowledge </li></ul><ul><li>The most cited papers – nodes with high degree (Barabási) – clearly exposed </li></ul><ul><li>Distortions in knowledge caused by mis-citation can be identified </li></ul><ul><li>home </li></ul>
The Open Citations Corpus <ul><li>The reference lists extracted from all 204,637 articles in the Open Access Subset of PMC (as of 24 January 2011), each encoded as a Named Graph </li></ul><ul><li>These reference lists contain 6,325,178 individual references , some unique, but many from different citing articles to the same highly cited papers </li></ul><ul><li>These refer to 3,373,961 unique papers outside the Open Access Subset </li></ul><ul><ul><li>~ 20% of all PubMed Central papers published between 1950 and 2010 </li></ul></ul><ul><ul><li>includes ALL the highly cited papers in every biomedical field </li></ul></ul><ul><li>Data freely available under a CC0 waiver from http://opencitations.net/data/ </li></ul><ul><li>We would now like to expand the corpus to include data citations , e.g. </li></ul><ul><ul><li>references to journal articles from Dryad data packages </li></ul></ul><ul><ul><li>the inferable reciprocal references from these articles to Dryad </li></ul></ul>
Viewing citation networks at http://opencitations.net
Using the citation data - Open Research Reports Top Papers for Open Research Reports Number of papers cited Pubmed IDs of 20 most highly cited papers (with number of times cited) Disease name 1 2 3 4 Cholera 1,993 10952301 47 15242645 44 2836362 25 16432199 24 Dengue fever 3,858 17510324 44 9665979 42 1372617 34 15577938 32 HIV/AIDS 54,432 9516219 122 12167863 101 9539414 86 12742798 83 Leprosy 1,147 11234002 70 17604718 18 15894530 13 12901893 12 Leptospirosis 940 11292640 47 14652202 37 12712204 27 15028702 26 Malaria 25,290 12368864 230 12364791 146 781840 134 12893887 101 Measles 1,719 11742391 22 16262740 19 15798843 18 8974392 13 Pneumonia 6,901 8995086 60 15699079 53 11463916 49 10524952 47 Schistosomiasis 3,036 15866310 49 12973350 46 16790382 43 4675644 40 Trypanosomiasis 5,864 16020726 108 16020725 75 10215027 57 43092 35 Tuberculosis 16,091 9634230 117 9157152 83 12742798 83 8381814 80 Amyotrophic lateral sclerosis 2,380 8446170 46 17023659 32 11386269 22 15217349 22 Spinal muscular atrophy 555 7813012 28 10339583 20 11925564 20 9074884 15 Total exluding ALS and SMA 121,271 Total 124,206 Average 9,554
end . . . with thanks to the JISC for funding over recent years and acknowledgement of the excellent work of my colleagues who have contributed to the following JISC projects: ADMIRAL / DataFlow Graham Klyne, Diana Galletly, Bhavana Ananda, Anusha Ranganathan, Sally Rumsey, Neil Jeffreys (Bodleian Library) Open Citations Ben O ’ Steen and Alex Dutton Dryad-UK Tanya Gray (MIIDI), Silvio Peroni (SPAR ontologies) Brian Hole (British Library) e-mail: email@example.com
Why publish research datasets in central repositories? <ul><li>It is widely recognised that the research results from publicly funded research projects should be made publicly available </li></ul><ul><li>Publishing research data should simply be seen as an extension of the publication process for research papers </li></ul><ul><li>Centralized subject-specific repositories like Dryad, with streamlined curation processes, are highly cost-effective, in comparison with each journal taking on a massive expansion of its own Supplementary Materials capabilities </li></ul><ul><li>In addition, the data will be less fragmented, openly accessible, easily searched and interoperable </li></ul><ul><li>Imagine what a mess we would be in now if all our DNA sequence data were scattered among the Supplementary Materials holdings of different journals! </li></ul>
How should such data publishing be funded? <ul><li>Research funding agencies should pay for startup and R&D costs </li></ul><ul><li>The primary beneficiaries (i.e. the scientific community) should sustain the ongoing operating costs of preserving their own research data </li></ul><ul><ul><li>using the same economic model (author charges, society funds, subscriptions, etc.) that funds the associated journal articles </li></ul></ul><ul><li>Centralized research data repositories benefit from the same economies of scale that publishers enjoy in operating multiple journals </li></ul><ul><li>The average total publishing and distribution costs per article amount to about £4,000 </li></ul><ul><ul><li>(RIN 2008 report: Activities, Costs and Funding Flows in the Scholarly Communications System) </li></ul></ul><ul><li>The Dryad business model is for each participating journal to pay a fee of ~$50 per paper, ~1% of the total cost of publishing the article </li></ul>
Conversion of hypothesis to ‘ fact ’ by citation alone <ul><li>Citation : </li></ul><ul><li>Steven Greenberg (2009). </li></ul><ul><li>How citation distortions create unfounded authority: analysis of a citation network. </li></ul><ul><li>British Medical Journal 339 : b2608. </li></ul>
Clustering of CiTO relationships by similarity Positive Agrees with Confirms Credits Supports Neutral Cites Cites as related Discusses Reviews Extends Negative Corrects Qualifies Disagrees with Disputes Refutes Critiques Parodies Ridicules Cites as authority Cites as evidence Obtains background from Obtains support from Contains assertion from Uses data from Uses method in Cites as data source Cites for information Documents Updates Includes excerpt from Includes quotation from Plagiarizes Cites as metadata document Cites as source document Shares authors with Rhetorical Factual
SPAR – Semantic Publishing and Referencing Ontologies http://purl.org/spar/
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.