• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Pulverer-embo-source data-nfdp13
 

Pulverer-embo-source data-nfdp13

on

  • 713 views

Presentation by Bernd Pulverer on EMBO's 'Source Data' and the next generation of open access given at the Now and Future of Data Publishing Symposium, 22 May 2013, Oxford, UK

Presentation by Bernd Pulverer on EMBO's 'Source Data' and the next generation of open access given at the Now and Future of Data Publishing Symposium, 22 May 2013, Oxford, UK

Statistics

Views

Total Views
713
Views on SlideShare
680
Embed Views
33

Actions

Likes
0
Downloads
4
Comments
0

1 Embed 33

https://twitter.com 33

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Transparent process provides a permissive environment for the publication of ethically robust papers by releasing some of the pressures in the race to publish in biology.
  • I would like to present some initiatives and ideas we have with regard to published data. They represent an extension of the concept of transparency to the data we publish in our journals, but also and extension to the concept of open access . Several of these ideas are currently being developed in a project called SourceData that I will briefly summarize.
  • Data are the heart of a paper: free text is author interpretation; data is absolute . WE think it is important to think about how data are presented in figures. But publishing faces many challenges. One of which is that that the rate of publishing is increasing: 1 Mio papers are currently indexed in every year, twice as much as 10 years ago. Some journals like PLOS one are in exponential growth phase. It is thus becoming harder and harder to search the literature and find specific information. While less and less people actually manage to keep up with this mass of human-readable papers, we rely more and more on machines to access these documents which are however poorly machine-readable.
  • Deconstruct a paper : it is a stacked layered structure that allow access to the content with increasing depth, from the title, abstracts down to the data. Title and abstract provide quick access to the browser. synopses and visual abstracts provide summaries of key facts The core is the main paper which is optimized for the human reader. At a deeper level there are supplementary information, structured datasets and computer codes.
  • What we would like to achieve in the near future is to eliminate the concept of supp info : - the volume of these supplementary sections is continuously growing, -they are not well reviewed, not copy edited, often not well presented -sometimes contain data only peripherally related to the main conclusions.
  • Instead of supp info, we propose to have an expert view of the paper. Some data can be repetitive and make papers difficult to read. We propose to have two views of the paper: one for general readers , that correspond to the main paper as we know it now, and an expert view where the additional in-depth information, data are included within the paper as expandable/colapsible sections.
  • Similarly, to encourage maximal use/reuse, we will make all the datasets and source data freely available under a CC0 license by default.
  • Data are the core of a research paper, yet these figures are published as images, that is a collection of pixels, making it impossible to re-analyse, to re-use or to find data easily. > affects all journals, whether they are open access or not. > most of the published scientific data remains locked inside the papers.
  • To start addressing this challenge, we have started the SourceData project. With SourceData we want to be able to publish figures into structured digital objects by linking Figures with Source data and machine readable metadata. This in order to improve data transparency, to promote data sharing and to enable data-oriented search.
  • Three components… This first step of the project is to enable authors to provide the raw data that are behind the figures. This can be done in several ways: either host the data in Journals , and I will show an example in a minute, or elect to host the data on one of the several ‘ unstructured data repositories’ such as Dryad and provide links to these resources.
  • This is not limited to numerical data. Here is an example from the EMBO Journal where the full gels are provided as uncropped images, allowing readers to examine the blots beyond the narrow slices usually displayed in published figures. Also micrographs.
  • Datasets alone are however of limited utility: Need to associate these data with structured metadata that explain the biological content of the data. To be useful for data mining, this must be done in a machine-readable way which involves the use of standard identifiers, controlled vocabularies and existing ontologies .
  • The second level will encode a fundamental structure common to many biomedical experiments. This is most easily seen for data that are represented as a plot: these data result from an experiments that where a given biological component Y was measured or observed as a function of various experimental conditions or perturbations A, B, C, using an assay P in a defined experimental system S. This separation between components that are observed — the phosphorylation level of a protein — and the components that are perturbed — for example a kinase inhibited by a drug — can be applied across an extremely broad range of published data, whether western blots, histological preparations or microscopy. This is because the model is able to represent the causality underlying an experimental design. This representation of directional relationship between biological components represents a backbone model on which much details can be elaborated. It is thus scalable as a model, in the sense that it can be extended and refined and specialized models can be derived from this backbone.
  • So we are currently developing tools that will enable curation of accepted manuscript by data editors and embed the curation process in the production process.
  • Finally, the third component is to use the machine-readable metadata to enable data-oriented searches of the papers based on the data they contain. The semantic information provided by the metadata will help overcoming some of the limitations of text-based searches. Source Data will make figures more useable. The search will make them discoverable
  • Source Data will allow to search papers through their data. If for example we are interested in finding data about CDK1 substrates , we can formulate a more or less complex query in PubMed. In this case, we would find a series of papers. To check the relevance we would have to open them, read the abstract, check the article and the figures. If the figures would be have been annotated with SourceData metadata, we could search directly for published experiments where measurements were conducted under conditions where CDK1 activity was perturbed. This would lead us to the relevant data inside the papers, from where we can link out to the associated papers.
  • As a consequence, related experiments can be found across papers in the literature and joined in a directional way to help generating hypotheses. In this example: drug Z might be interesting to test for disease D. It would be extremely difficult to perform such tasks in a systematical manner with conventional search strategies. This is an applications that goes beyond mere search and is a step towards the integration of multiple datasets. This will potentially be an extremely powerful feature to generate new hypotheses and potentially new findings.
  • find figures that are closely related to each other. With this function, from a starting figure, it would be possible to find related figures and the respective papers. This is a function that resembles the function ‘related articles’ in PubMed but applied to individual figure panels.
  • To conclude, the last ten years has seen profound changes in scientific publishing: the transition to online publishing and open access content has opened the door to the large-scale systematic mining of the literature. This transition needs however to be completed to go beyond access to the text and offer deeper access to the research data. The current format of the human-readable version of the papers will remain . But the paper of the future will need to be associated to a machine-readable version of the paper. With SourceData, we will make published data useable by linking them to explanatory machine-readable metadata. These metadata will in turn enable data-oriented search functions that will increase the discoverability the papers. This will represent the next generation of open access which will enable a much deeper access to the literature and a systematical mining and integration of published data across the literature. Such transformation will be needed to benefit from the potential of research data to generate new findings and accelerate scientific discovery.
  • It is very early days to predict how the data ecosystem will stabilize both at the technical and economical level. From the users point of view, the basic tasks have to remain as simple and straightforward as possible: authors want to submit their papers and data, and readers need to access the data and search the literature. The role of SourceData in this ecosystem is that it will provide a series of tools and services that will will create a win-win-win situation across the major stakeholders: authors will benefit from the increase discoverability of their research, journals and data repositories will increase the visibility of their content and add more value to their content, which is a crucial issue currently in publishing readers will have a greater and deeper access to data and to the literature.
  • 50 panels on TGF beta signaling data, annotated primitivelly
  • These data that are published in papers are mainly presented in figures. It is in the figures that the evidence that support conclusions are shown. Figures are absolutely essential to make a formal scientific proof.
  • The fact that these large datasets do not fit the classical format of published papers, especially when print was still relevant, has created a situation where papers and data largely live parallel and separate lives: Papers are published in journals, datasets are deposited in databases. This has maybe conditioned us to think of scientific publishing and data dissemination in separate terms.
  • The importance of making research data available has been largely driven by the fields in biology that produce large-scale datasets—genomics and the other omics fields, but also structural biology.
  • This has serious consequences for search, which has become essential to find specific information in this ocean of papers.
  • This first step of the project is to enable authors to provide the raw data that are behind the figures—we call these source data and this gave the name to the entire project. This can be done in several ways: either host the data on the publisher’s website, and I will show an example in a minute, or elect to host the data on one of the several ‘unstructured data repositories’ such as Dryad and provide links to these resources. Datasets alone are however of limited utility. The second component which central to the present proposal today is to associate these data with structured metadata that explain the biological content of the data. To be useful for global data mining, this must be done in a machine-readable way which involves the use of standard identifiers and controlled vocabularies and existing ontologies. Finally, the third component is to use the machine-readable metadata to enable data-oriented searches of the papers based on the data they contain. The semantic information provided by the metadata will help overcoming some of the limitations of text-based searches. The metadata will make the data more useable. The search will make them discoverable So, let us briefly review these three components.

Pulverer-embo-source data-nfdp13 Pulverer-embo-source data-nfdp13 Presentation Transcript

  • EMBO SourceData – Next Gen Open Accesss Bernd Pulverer Chief Editor | The EMBO Journal Head | Scientific Publications
  • Data transparency
  • Scientific publishing – Dominant channel for the dissemination of peer- reviewed data. – Journals function as a proxy for quality in research assessment – The rate of publishing keeps increasing. – Papers are human-readable but poorly machine-readable. 5/27
  • Title Abstract Synopsis Main paper Supp Info Datasets The Research Paper
  • Title Abstract Synopsis Main paper Supp Info Datasets Expert View The Research Paper
  • ‘Expert View’ • All the data required to support the conclusions included in the paper. • ‘General reader’ vs. ‘expert’ view of the paper: – Expandable/collapsible ‘inline’ sections, – Copy edited. • Restricted to select types of data and information: – Replicates – Controls, experimental optimization – ‘Negative’ results – Extended experimental protocols – Computational algorithms • Datasets presented as separate files. • No further reaching data 6
  • Title Abstract Synopsis Main paper Expert View Datasets Source data
  • What is a figure? A scientific result converted into a collection of pixels 8/27
  • Discoverable, rich content n seeing all the data – nt lever that we have for transparency’ Michael F
  • SourceData Tools to publish figures as structured digital objects that link the human-readable illustrations with machine- readable metadata and ‘source data’ in order to • improve data transparency (ethics) • make published data (re)useable • enable data-oriented search 9/27
  • Metadata •Focus on the biological content •Use standard identifiers and existing controlled vocabularies Search •Data-oriented semantic search of the literature. •Overcome some of the limitations of keyword- based search 10/27 SourceData Data •Figure source data files hosted by the journals •Link to data repositories
  • •Archive •Transparency •Revisualization •Reuse •Integration •Search •Discourage manipulation o voluntary o ~40% papers
  • 12/27
  • No No Yes Yes Data Transparency
  • Metadata •Focus on the biological content •Use standard identifiers and existing controlled vocabularies Search •Data-oriented semantic search of the literature. •Overcome some of the limitations of keyword- based search 10/27 SourceData Data •Figure source data files hosted by the journals •Link to data repositories
  • Structured metadata: ‘perturbation-observation-assay’ 1. ‘Object-oriented’ representation of experimental variables: list biological components. 2. Retain the causality of the experimental design: “Measurement of Y as a function of A, B, C, using assay P in biological system S.” 3. Machine-readable representation with standard identifiers. measured component measured component perturbed component perturbed component experimental system 15/27 assayed property
  • Data copy editors 18
  • Data •Figure source data files hosted by the journals •Link to data repositories Metadata •Focus on the biological content •Use standard identifiers and existing controlled vocabularies Search •Sata-oriented semantic search of the literature. •Overcome some of the limitations of keyword- based search 10/27 SourceData
  • Data-oriented search
  • Resulting hypothesis: test drug Z in disease D. tissue Ttissue T disease D disease D gene xgene x Paper3 protein Xprotein X PPkinase Ykinase Y Paper2 kinase Ykinase Y activityactivitydrug Zdrug Z Paper1 Data-oriented search 19/27
  • Data-oriented search CREBforskolin CREBforskolin CREBforskolin CREBtime Query: More-like-this: 17/27
  • sdAnnotations:annotationID a sdCore:PerturbationMeasurmentExp; :linkedToPanel sdPanels:panelID; :hasVariable sdVariables:variable1; :hasVariable sdVariables:variable2; :usingBiologicalSystem sdBiolSystem:biolSystemNode; :basedOnSourcedataset sdSourceDatasets:dsID . ‘Next Generation’ Open Access Data SearchMetadata
  • 24
  • Raw, rare, well done...?
  • From raw to processed data
  • A data ‘ecosystem’ data access search ReaderReader paper data AuthorAuthor SourceDataSourceData JournalsJournals Data repositoriesData repositories 26/27
  • Distributed infrastructure Database Journals Users Users ResearchdataResearchdata
  • Smad3 Hey1 TGFbeta VE-cdh Rad51 foci AR Tsc2 1 4 6 2 5 3 1,4 4 5 6 2 … … Rad51 Nuclear complexesTGFb, Smad3
  • Literature search engines PubMed 72% PubMed 72% Europe PMC <2% Europe PMC <2% Google 17% Google 17%
  • Data are published in papers 7/27
  • ‘Publishing’ papers ‘Depositing’ datasets
  • Availability of published data and software • Datasets obtained by experimentation, computation or data mining, should be made freely available, without restriction. • Software should be described in sufficient detail to allow reproduction. If a specific implementation is the focus of the study, free access for non-commercial users is strongly recommended. • Deposition of data should preferably be in one of the public databases prior to submission.
  • Data deposition Large-scale datasets, sequences, atomic coordinates and computational models should be deposited in one of the relevant public databases prior to submission (provided private access is available at the database) and authors should include accession codes in the Materials & Methods section.
  • Big
  • Public databases Structural data PDB, NDB, EMDataBank Functional genomics GEO, ArrayExpress Proteomics Pride, PeptideAtlas, PASSEL PPI IMEx consortium Clinical genomics datasets EGA, dbGAP Metagenomics Genbank Computational models BioModels, JWS
  • search
  • SourceData Data •Figure source data files hosted by the journals •Link to ‘unstructured data’ repositories Metadata •Focus on the biological content •Use standard identifiers and existing controlled vocabularies Search •Data-oriented semantic search of the literature. •Overcome some of the limitations of keyword- based search 10/27
  • 43