The PDB An Exemplar for Data Science To
Date, But What About the Future?
Philip E. Bourne Ph.D.
Associate Director for Dat...
Background
6/12 2/14 3/14
• Findings:
• Sharing data & software through catalogs
• Support methods and applications develo...
Motivation for This Talk
Source Michael Bell http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830
More Motivation
The way we fund and operate
biomedical databases will not scale.
How do we keep the best features of
todays resources but ...
Disclaimer: This is NOT a talk about
the PDB per se, but a talk about data
resources in general, but using the
PDB as an e...
Good News: We Trust the PDB
PDB
Trust in the data
is perhaps the PDB’s
biggest achievement
Good News: Trust
 Trust is like compound interest
 Comes from listening
 Comes from engaging the community in every asp...
Good News/Bad News Re Data Quality
 Good News:
– If done right in the
beginning 25% of the
PDB’s budget could have
been s...
Good News/Bad News Re Community
 Good News:
– The community is
engaged
– The community has
driven data sharing
 Bad News...
How we do science is changing. Do
data resources including the PDB best
serve the needs of the user at this
point?
How is Science Changing?
 More interdisciplinary
 More translational
 More access to diverse data types
 More computat...
Good News/Bad News for the PDB in
this Changing Landscape
 Bad News:
– Interface complex and
uni-data oriented
– Data acc...
General Problem Statement:
How to insure a high quality
annotated data source that provides
the optimal environment for
ac...
Okay so what can the funders do to
address a situation where really the
PDB is currently a best case
scenario?
1. Encourage more
understanding for how
existing data are used
* http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm
J...
We Need to Learn from Industries Whose
Livelihood Addresses the Question of Use
2. Address the Issue that
Scholarship is Broken
 I have a paper with 17,500 citations that no one has
ever read
 I have ...
3. Address the Reward System
4. Enable Reproducibility
 Much of the research life cycle is now digital -
encourage the reliability, accessibility, fin...
5. Establish The Commons
 Public/private partnership
 Work with IC’s, NCBI and CIT to identify and run
pilots – cloud, H...
Sustainability and Sharing: The Commons
Data
The Long Tail
Core Facilities/HS Centers
Clinical /Patient
The Why:
Data Shar...
What Does the Commons Enable?
 Dropbox like storage
 The opportunity to apply quality metrics
 Bring compute to the dat...
The PDB in the Commons
 Components:
– Annotated collection of data files
– API’s to access these data files
– Example met...
Some Acknowledgements
 Eric Green & Mark Guyer (NHGRI)
 Jennie Larkin (NHLBI)
 Leigh Finnegan (NHGRI)
 Vivien Bonazzi ...
NIHNIH……
Turning Discovery Into HealthTurning Discovery Into Health
philip.bourne@nih.gov
Upcoming SlideShare
Loading in...5
×

The PDB An Exemplar for Data Science To Date, But What About the Future?

831

Published on

Keynote Presented at 3DSIG Boston MA, USA July 12, 2014.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
831
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "The PDB An Exemplar for Data Science To Date, But What About the Future?"

  1. 1. The PDB An Exemplar for Data Science To Date, But What About the Future? Philip E. Bourne Ph.D. Associate Director for Data Science National Institutes of Health
  2. 2. Background 6/12 2/14 3/14 • Findings: • Sharing data & software through catalogs • Support methods and applications development • Need more training • Hire CSIO • Continued support throughout the lifecycle http://acd.od.nih.gov/diwg.htm
  3. 3. Motivation for This Talk Source Michael Bell http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830
  4. 4. More Motivation
  5. 5. The way we fund and operate biomedical databases will not scale. How do we keep the best features of todays resources but also respond to shrinking budgets and changes in the way we do science? Lets address this question using the PDB as an example
  6. 6. Disclaimer: This is NOT a talk about the PDB per se, but a talk about data resources in general, but using the PDB as an example since we are all familiar with it and it is considered an exemplar by most stakeholders
  7. 7. Good News: We Trust the PDB PDB Trust in the data is perhaps the PDB’s biggest achievement
  8. 8. Good News: Trust  Trust is like compound interest  Comes from listening  Comes from engaging the community in every aspect of the process  Comes from data consistency and level of annotation  Comes from responsiveness  Comes from the quality of the delivery service
  9. 9. Good News/Bad News Re Data Quality  Good News: – If done right in the beginning 25% of the PDB’s budget could have been saved – Ontologies can work – Automation has reduced cost even as the amount of data has increased – Reproducibility is improved  Bad News: – Complex ontologies slow adoption – All data are created equal – Annotation is limited
  10. 10. Good News/Bad News Re Community  Good News: – The community is engaged – The community has driven data sharing  Bad News: – The community does not reduce costs through active participation – There is insufficient reward for being part of the community e.g. as an annotator
  11. 11. How we do science is changing. Do data resources including the PDB best serve the needs of the user at this point?
  12. 12. How is Science Changing?  More interdisciplinary  More translational  More access to diverse data types  More computational  More collaborative
  13. 13. Good News/Bad News for the PDB in this Changing Landscape  Bad News: – Interface complex and uni-data oriented – Data accessible; methods accessible (sort of); but not together – Significant redundancy in services offered  Good News: – Annotation! – Demand is increasing – Integrated with other data types – Restful services
  14. 14. General Problem Statement: How to insure a high quality annotated data source that provides the optimal environment for accessibility and analysis by a broad community of diverse users?
  15. 15. Okay so what can the funders do to address a situation where really the PDB is currently a best case scenario?
  16. 16. 1. Encourage more understanding for how existing data are used * http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm Jan. 2008 Jan. 2009 Jan. 2010Jul. 2009Jul. 2008 Jul. 2010 1RUZ: 1918 H1 Hemagglutinin Structure Summary page activity for H1N1 Influenza related structures 3B7E: Neuraminidase of A/Brevig Mission/1/1918 H1N1 strain in complex with zanamivir [Andreas Prlic]
  17. 17. We Need to Learn from Industries Whose Livelihood Addresses the Question of Use
  18. 18. 2. Address the Issue that Scholarship is Broken  I have a paper with 17,500 citations that no one has ever read  I have papers in PLOS ONE that have more citations than ones in PNAS  I have data sets I am proud of few places to put them  I edited a journal but it did not count for much
  19. 19. 3. Address the Reward System
  20. 20. 4. Enable Reproducibility  Much of the research life cycle is now digital - encourage the reliability, accessibility, findability, usability of data, methods, narrative, publications etc.  How?  Data sharing plans  Standards frameworks  Data and software catalogs  PubMedCentral ? The Commons – PMC for the complete lifecycle ? Machine readable data sharing plans ? Small funding to communities ? Support for training and best practices in eScholarship
  21. 21. 5. Establish The Commons  Public/private partnership  Work with IC’s, NCBI and CIT to identify and run pilots – cloud, HPC centers  Port DbGAP to the cloud ? Experiment with new funding strategies  Evaluate
  22. 22. Sustainability and Sharing: The Commons Data The Long Tail Core Facilities/HS Centers Clinical /Patient The Why: Data Sharing Plans The Commons Government The How: Data Discovery Index Sustainable Storage Quality Scientific Discovery Usability Security/ Privacy Commons == Extramural NCBI == Research Object Sandbox == Collaborative Environment The End Game: KnowledgeNIH Awardees Private Sector Metrics/ Standards Rest of Academia Software Standards Index BD2K Centers Cloud, Research Objects,
  23. 23. What Does the Commons Enable?  Dropbox like storage  The opportunity to apply quality metrics  Bring compute to the data  A place to collaborate  A place to discover http://100plus.com/wp-content/uploads/Data-Commons-3- 1024x825.png
  24. 24. The PDB in the Commons  Components: – Annotated collection of data files – API’s to access these data files – Example methods using these APIs  Potential outcomes – Nothing happens? – A new breed of developer starts to use PDB data in new ways ? – The casual user has a broader set of services that previously? – Quality declines?
  25. 25. Some Acknowledgements  Eric Green & Mark Guyer (NHGRI)  Jennie Larkin (NHLBI)  Leigh Finnegan (NHGRI)  Vivien Bonazzi (NHGRI)  Michelle Dunn (NCI)  Mike Huerta (NLM)  David Lipman (NLM)  Jim Ostell (NLM)  Andrea Norris (CIT)  Peter Lyster (NIGMS)  All the over 100 folks on the BD2K team
  26. 26. NIHNIH…… Turning Discovery Into HealthTurning Discovery Into Health philip.bourne@nih.gov
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×