Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
EMBO SourceData
– Next Gen Open Accesss
Bernd Pulverer
Chief Editor | The EMBO Journal
Head | Scientific Publications
Data transparency
Scientific publishing
– Dominant channel for the
dissemination of peer-
reviewed data.
– Journals function as a proxy
for ...
Title
Abstract
Synopsis
Main paper
Supp Info
Datasets
The Research Paper
Title
Abstract
Synopsis
Main paper
Supp Info
Datasets
Expert View
The Research Paper
‘Expert View’
• All the data required to support the conclusions
included in the paper.
• ‘General reader’ vs. ‘expert’ vi...
Title
Abstract
Synopsis
Main paper
Expert View
Datasets
Source data
What is a figure?
A scientific result converted into a
collection of pixels
8/27
Discoverable, rich content
n seeing all the data –
nt lever that we have for transparency’
Michael F
SourceData
Tools to publish figures as structured digital objects that
link the human-readable illustrations with machine-...
Metadata
•Focus on the biological
content
•Use standard identifiers
and existing controlled
vocabularies
Search
•Data-orie...
•Archive
•Transparency
•Revisualization
•Reuse
•Integration
•Search
•Discourage
manipulation
o voluntary
o ~40% papers
12/27
No
No
Yes
Yes
Data Transparency
Metadata
•Focus on the biological
content
•Use standard identifiers
and existing controlled
vocabularies
Search
•Data-orie...
Structured metadata:
‘perturbation-observation-assay’
1. ‘Object-oriented’ representation of experimental
variables: list ...
Data copy editors
18
Data
•Figure source data files
hosted by the journals
•Link to data repositories
Metadata
•Focus on the biological
content...
Data-oriented search
Resulting hypothesis: test drug Z in disease D.
tissue Ttissue T disease
D
disease
D
gene xgene x
Paper3 protein Xprotein ...
Data-oriented search
CREBforskolin CREBforskolin CREBforskolin CREBtime
Query: More-like-this:
17/27
sdAnnotations:annotationID a
sdCore:PerturbationMeasurmentExp;
:linkedToPanel sdPanels:panelID;
:hasVariable
sdVariables:v...
24
Raw, rare, well done...?
From raw to processed data
A data ‘ecosystem’
data access
search
ReaderReader
paper
data
AuthorAuthor
SourceDataSourceData
JournalsJournals Data repo...
Distributed infrastructure
Database
Journals
Users
Users
ResearchdataResearchdata
Smad3
Hey1
TGFbeta
VE-cdh
Rad51
foci
AR
Tsc2
1 4
6 2 5
3
1,4
4
5
6
2
…
…
Rad51
Nuclear
complexesTGFb, Smad3
Literature search engines
PubMed
72%
PubMed
72%
Europe PMC
<2%
Europe PMC
<2%
Google
17%
Google
17%
Data are published in papers
7/27
‘Publishing’
papers
‘Depositing’
datasets
Availability of published data and
software
• Datasets obtained by experimentation, computation or
data mining, should be ...
Data deposition
Large-scale datasets, sequences, atomic coordinates
and computational models should be deposited in
one of...
Big
Public databases
Structural data PDB, NDB, EMDataBank
Functional genomics GEO, ArrayExpress
Proteomics Pride, PeptideAtlas...
search
SourceData
Data
•Figure source data files
hosted by the journals
•Link to ‘unstructured
data’ repositories
Metadata
•Focus...
43
Pulverer-embo-source data-nfdp13
Pulverer-embo-source data-nfdp13
Pulverer-embo-source data-nfdp13
Pulverer-embo-source data-nfdp13
Pulverer-embo-source data-nfdp13
Pulverer-embo-source data-nfdp13
Pulverer-embo-source data-nfdp13
Pulverer-embo-source data-nfdp13
Upcoming SlideShare
Loading in …5
×

Pulverer-embo-source data-nfdp13

992 views

Published on

Presentation by Bernd Pulverer on EMBO's 'Source Data' and the next generation of open access given at the Now and Future of Data Publishing Symposium, 22 May 2013, Oxford, UK

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Pulverer-embo-source data-nfdp13

  1. 1. EMBO SourceData – Next Gen Open Accesss Bernd Pulverer Chief Editor | The EMBO Journal Head | Scientific Publications
  2. 2. Data transparency
  3. 3. Scientific publishing – Dominant channel for the dissemination of peer- reviewed data. – Journals function as a proxy for quality in research assessment – The rate of publishing keeps increasing. – Papers are human-readable but poorly machine-readable. 5/27
  4. 4. Title Abstract Synopsis Main paper Supp Info Datasets The Research Paper
  5. 5. Title Abstract Synopsis Main paper Supp Info Datasets Expert View The Research Paper
  6. 6. ‘Expert View’ • All the data required to support the conclusions included in the paper. • ‘General reader’ vs. ‘expert’ view of the paper: – Expandable/collapsible ‘inline’ sections, – Copy edited. • Restricted to select types of data and information: – Replicates – Controls, experimental optimization – ‘Negative’ results – Extended experimental protocols – Computational algorithms • Datasets presented as separate files. • No further reaching data 6
  7. 7. Title Abstract Synopsis Main paper Expert View Datasets Source data
  8. 8. What is a figure? A scientific result converted into a collection of pixels 8/27
  9. 9. Discoverable, rich content n seeing all the data – nt lever that we have for transparency’ Michael F
  10. 10. SourceData Tools to publish figures as structured digital objects that link the human-readable illustrations with machine- readable metadata and ‘source data’ in order to • improve data transparency (ethics) • make published data (re)useable • enable data-oriented search 9/27
  11. 11. Metadata •Focus on the biological content •Use standard identifiers and existing controlled vocabularies Search •Data-oriented semantic search of the literature. •Overcome some of the limitations of keyword- based search 10/27 SourceData Data •Figure source data files hosted by the journals •Link to data repositories
  12. 12. •Archive •Transparency •Revisualization •Reuse •Integration •Search •Discourage manipulation o voluntary o ~40% papers
  13. 13. 12/27
  14. 14. No No Yes Yes Data Transparency
  15. 15. Metadata •Focus on the biological content •Use standard identifiers and existing controlled vocabularies Search •Data-oriented semantic search of the literature. •Overcome some of the limitations of keyword- based search 10/27 SourceData Data •Figure source data files hosted by the journals •Link to data repositories
  16. 16. Structured metadata: ‘perturbation-observation-assay’ 1. ‘Object-oriented’ representation of experimental variables: list biological components. 2. Retain the causality of the experimental design: “Measurement of Y as a function of A, B, C, using assay P in biological system S.” 3. Machine-readable representation with standard identifiers. measured component measured component perturbed component perturbed component experimental system 15/27 assayed property
  17. 17. Data copy editors 18
  18. 18. Data •Figure source data files hosted by the journals •Link to data repositories Metadata •Focus on the biological content •Use standard identifiers and existing controlled vocabularies Search •Sata-oriented semantic search of the literature. •Overcome some of the limitations of keyword- based search 10/27 SourceData
  19. 19. Data-oriented search
  20. 20. Resulting hypothesis: test drug Z in disease D. tissue Ttissue T disease D disease D gene xgene x Paper3 protein Xprotein X PPkinase Ykinase Y Paper2 kinase Ykinase Y activityactivitydrug Zdrug Z Paper1 Data-oriented search 19/27
  21. 21. Data-oriented search CREBforskolin CREBforskolin CREBforskolin CREBtime Query: More-like-this: 17/27
  22. 22. sdAnnotations:annotationID a sdCore:PerturbationMeasurmentExp; :linkedToPanel sdPanels:panelID; :hasVariable sdVariables:variable1; :hasVariable sdVariables:variable2; :usingBiologicalSystem sdBiolSystem:biolSystemNode; :basedOnSourcedataset sdSourceDatasets:dsID . ‘Next Generation’ Open Access Data SearchMetadata
  23. 23. 24
  24. 24. Raw, rare, well done...?
  25. 25. From raw to processed data
  26. 26. A data ‘ecosystem’ data access search ReaderReader paper data AuthorAuthor SourceDataSourceData JournalsJournals Data repositoriesData repositories 26/27
  27. 27. Distributed infrastructure Database Journals Users Users ResearchdataResearchdata
  28. 28. Smad3 Hey1 TGFbeta VE-cdh Rad51 foci AR Tsc2 1 4 6 2 5 3 1,4 4 5 6 2 … … Rad51 Nuclear complexesTGFb, Smad3
  29. 29. Literature search engines PubMed 72% PubMed 72% Europe PMC <2% Europe PMC <2% Google 17% Google 17%
  30. 30. Data are published in papers 7/27
  31. 31. ‘Publishing’ papers ‘Depositing’ datasets
  32. 32. Availability of published data and software • Datasets obtained by experimentation, computation or data mining, should be made freely available, without restriction. • Software should be described in sufficient detail to allow reproduction. If a specific implementation is the focus of the study, free access for non-commercial users is strongly recommended. • Deposition of data should preferably be in one of the public databases prior to submission.
  33. 33. Data deposition Large-scale datasets, sequences, atomic coordinates and computational models should be deposited in one of the relevant public databases prior to submission (provided private access is available at the database) and authors should include accession codes in the Materials & Methods section.
  34. 34. Big
  35. 35. Public databases Structural data PDB, NDB, EMDataBank Functional genomics GEO, ArrayExpress Proteomics Pride, PeptideAtlas, PASSEL PPI IMEx consortium Clinical genomics datasets EGA, dbGAP Metagenomics Genbank Computational models BioModels, JWS
  36. 36. search
  37. 37. SourceData Data •Figure source data files hosted by the journals •Link to ‘unstructured data’ repositories Metadata •Focus on the biological content •Use standard identifiers and existing controlled vocabularies Search •Data-oriented semantic search of the literature. •Overcome some of the limitations of keyword- based search 10/27
  38. 38. 43

×