The Thinking Behind Big Data at the NIH
Philip E. Bourne Ph.D.
Associate Director for Data Science
National Institutes of ...
Disclaimer: I only started March 3,
2014
…but I and others had been thinking about this
prior to my appointment
Let me start with a few examples of
what motivates our thinking …
The Story of Meredith
http://fora.tv/2012/04/20/Congress_Unplugged_
Phil_Bourne
Stephen Friend
We have Entered An Era of
Deinstitutionalize & Democratization
of Science
Daniel Hulshizer/Associated Press
We have Entered An Era of
Deinstitutionalize & Democratization
of Science – NIH Should Support This
Daniel Hulshizer/Assoc...
I can’t reproduce research from my
own laboratory?
Daniel Garijo et al. 2013 Quantifying Reproducibility in Computational ...
47/53 “landmark” publications
could not be replicated
[Begley, Ellis Nature,
483, 2012] [Carole Goble]
Reproducibility Studies Are On-going
Across the NIH
 Expected outcomes:
– Improved accessibility to data and software
– S...
You will notice that so far none of
these issues has to do with “Big
Data” per se
“Big Data” has simply bought more
attent...
What Worries Me the Most -
Sustainability
Source Michael Bell http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830
We Cant Go On Like This – Some
Options
 Introduction of business models
– The 50% model
– Mergers
– Acquisitions associat...
We don’t know enough
about how current data
are used!
* http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm
Jan. 2008 ...
Ironic Since Some Industries Thrive By
Asking These Questions
And This May Just be the Beginning
 Evidence:
– Google car
– 3D printers
– Waze
– Robotics
From: The Second Machine Age: ...
Scholarship is broken
 I have a paper with 16,000 citations that no one has
ever read
 I have papers in PLOS ONE that ha...
The reward system is in need of repair
Okay… enough of the problems
What are some solutions?
Approach to Solutions
 New policies, e.g. data sharing, blanket consent
 Funding where it is most needed
– New metrics
–...
How We Are Starting to Organize
Ourselves
Associate Director for Data Science
Commons
Training
Center
BD2K
Modified
Review
Sustainability* Education* Innovation* Pr...
Solution: The Power of the Commons
Data
The Long Tail
Core Facilities/HS Centers
Clinical /Patient
The Why:
Data Sharing P...
What Does the Commons Enable?
 Dropbox like storage
 The opportunity to apply quality metrics
 Bring compute to the dat...
Commons Timeline
 Spring/Summer 2014: DS group are gathering
information about activities and needs from ICs (and
outside...
Associate Director for Data Science
Commons
Training
Center
BD2K
Modified
Review
Sustainability* Education* Innovation* Pr...
TrainingTraining
 Summary of Training Workshop and Request for
Information:
– http://bd2k.nih.gov/faqs_trainingFOA.html
–...
BD2K Training RFAsBD2K Training RFAs
 K01s for Mentored Career Development Awards,
RFA-HG-14-007
 Provides salary and re...
Contemplating CSHL style training
center(s)
Associate Director for Data Science
Commons
Training
Center
BD2K
Modified
Review
Sustainability* Education* Innovation* Pr...
BD2K InnovationBD2K Innovation
 Data Discovery Index Coordination
Consortium (U24) (closed)
 Metadata standards (under d...
BD2K InnovationBD2K Innovation
 BISTI PARs
– BISTI: Biomedical Information Science and Technology
Initiative
– Joint BIST...
BD2K Innovation CentersBD2K Innovation Centers
 FY14
 Investigator-initiated Centers of Excellence for Big
Data Computin...
Associate Director for Data Science
Commons
Training
Center
BD2K
Modified
Review
Sustainability* Education* Innovation* Pr...
Some Thoughts About Process
 Machine readable data sharing plans?
 Open review?
 Micro funding?
 Standing data committ...
Associate Director for Data Science
Commons
Training
Center
BD2K
Modified
Review
Sustainability* Education* Innovation* Pr...
Where Do We Want to End Up?
Associate Director for Data Science
Commons
Training
Center
BD2K
Modified
Review
Sustainability* Education* Innovation* Pr...
Components of The Academic Digital
Enterprise
 Consists of digital assets
– E.g. datasets, papers, software, lab notes
 ...
Life in the Academic Digital Enterprise
 Jane scores extremely well in parts of her graduate on-line neurology class.
Neu...
Some Acknowledgements
 Eric Green & Mark Guyer (NHGRI)
 Jennie Larkin (NHLBI)
 Vivien Bonazzi (NHGRI)
 Michelle Dunn (...
NIHNIH……
Turning Discovery Into HealthTurning Discovery Into Health
philip.bourne@nih.gov
Upcoming SlideShare
Loading in …5
×

The Thinking Behind Big Data at the NIH

1,550 views

Published on

Presented at the Big Data in Biomedicine Conference at Stanford University May 21, 2014

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,550
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
44
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124
    http://www.reuters.com/article/2012/03/28/us-science-cancer-idUSBRE82R12P20120328
  • Can you put more than just one time point on this? How about making the last sub-bullet it’s own bullet (as shown)
    Yes – that is what we were thinking and wrestling with.
  • The Thinking Behind Big Data at the NIH

    1. 1. The Thinking Behind Big Data at the NIH Philip E. Bourne Ph.D. Associate Director for Data Science National Institutes of Health http://www.slideshare.net/pebourne/
    2. 2. Disclaimer: I only started March 3, 2014 …but I and others had been thinking about this prior to my appointment
    3. 3. Let me start with a few examples of what motivates our thinking …
    4. 4. The Story of Meredith http://fora.tv/2012/04/20/Congress_Unplugged_ Phil_Bourne Stephen Friend
    5. 5. We have Entered An Era of Deinstitutionalize & Democratization of Science Daniel Hulshizer/Associated Press
    6. 6. We have Entered An Era of Deinstitutionalize & Democratization of Science – NIH Should Support This Daniel Hulshizer/Associated Press
    7. 7. I can’t reproduce research from my own laboratory? Daniel Garijo et al. 2013 Quantifying Reproducibility in Computational Biology: The Case of the Tuberculosis Drugome PLOS ONE 8(11) e80278 . Can you? But what does it take and does it matter?
    8. 8. 47/53 “landmark” publications could not be replicated [Begley, Ellis Nature, 483, 2012] [Carole Goble]
    9. 9. Reproducibility Studies Are On-going Across the NIH  Expected outcomes: – Improved accessibility to data and software – Support for workflows – Closer relationships with publishers – Metrics for measuring reproducibility – Closure of the research lifecycle loop – Rewards for reproducibility
    10. 10. You will notice that so far none of these issues has to do with “Big Data” per se “Big Data” has simply bought more attention to these issues
    11. 11. What Worries Me the Most - Sustainability Source Michael Bell http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830
    12. 12. We Cant Go On Like This – Some Options  Introduction of business models – The 50% model – Mergers – Acquisitions associated with best practices – Centralization – Public/private partnerships – Fee for service – Archiving  Usage metrics / impact ….
    13. 13. We don’t know enough about how current data are used! * http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm Jan. 2008 Jan. 2009 Jan. 2010Jul. 2009Jul. 2008 Jul. 2010 1RUZ: 1918 H1 Hemagglutinin Structure Summary page activity for H1N1 Influenza related structures 3B7E: Neuraminidase of A/Brevig Mission/1/1918 H1N1 strain in complex with zanamivir [Andreas Prlic]
    14. 14. Ironic Since Some Industries Thrive By Asking These Questions
    15. 15. And This May Just be the Beginning  Evidence: – Google car – 3D printers – Waze – Robotics From: The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies by Erik Brynjolfsson & Andrew McAfee
    16. 16. Scholarship is broken  I have a paper with 16,000 citations that no one has ever read  I have papers in PLOS ONE that have more citations than ones in PNAS  I have data sets I am proud of few places to put them  I edited a journal but it did not count for much
    17. 17. The reward system is in need of repair
    18. 18. Okay… enough of the problems What are some solutions?
    19. 19. Approach to Solutions  New policies, e.g. data sharing, blanket consent  Funding where it is most needed – New metrics – De-identification – Agile pilots – Smaller funding for the many, but with appropriate governance – Competitions – Coordination across agencies and countries  Shared infrastructure  Support for new reward systems
    20. 20. How We Are Starting to Organize Ourselves
    21. 21. Associate Director for Data Science Commons Training Center BD2K Modified Review Sustainability* Education* Innovation* Process • Cloud – Data & Compute • Search • Security • Reproducibility Standards • App Store • Coordinate • Hands-on • Syllabus • MOOCs • Community • Centers • Training Grants • Catalogs • Standards • Analysis • Data Resource Support • Metrics • Best Practices • Evaluation • Portfolio Analysis The Biomedical Research Digital Enterprise Communication Collaboration rogrammatic Theme Deliverable Example Features • IC’s • Researchers • Federal Agencies • International Partners • Computer Scientists Scientific Data Council External Advisory Board * Hires made
    22. 22. Solution: The Power of the Commons Data The Long Tail Core Facilities/HS Centers Clinical /Patient The Why: Data Sharing Plans The Commons Government The How: Data Discovery Index Sustainable Storage Quality Scientific Discovery Usability Security/ Privacy Commons == Extramural NCBI == Research Object Sandbox == Collaborative Environment The End Game: KnowledgeNIH Awardees Private Sector Metrics/ Standards Rest of Academia Software Standards Index BD2K Centers Cloud, Research Objects,
    23. 23. What Does the Commons Enable?  Dropbox like storage  The opportunity to apply quality metrics  Bring compute to the data  A place to collaborate  A place to discover http://100plus.com/wp-content/uploads/Data-Commons-3- 1024x825.png
    24. 24. Commons Timeline  Spring/Summer 2014: DS group are gathering information about activities and needs from ICs (and outside communities). – Shared interests in developing cloud-based biomedical commons. – Investigating potential models of sustainability. – Exploring metrics of usefulness and success.  Fall 2014: Develop possible pilots to explore options in addition to those already being implemented by some ICs.
    25. 25. Associate Director for Data Science Commons Training Center BD2K Modified Review Sustainability* Education* Innovation* Process • Cloud – Data & Compute • Search • Security • Reproducibility Standards • App Store • Coordinate • Hands-on • Syllabus • MOOCs • Community • Centers • Training Grants • Catalogs • Standards • Analysis • Data Resource Support • Metrics • Best Practices • Evaluation • Portfolio Analysis The Biomedical Research Digital Enterprise Communication Collaboration rogrammatic Theme Deliverable Example Features • IC’s • Researchers • Federal Agencies • International Partners • Computer Scientists Scientific Data Council External Advisory Board * Hires made
    26. 26. TrainingTraining  Summary of Training Workshop and Request for Information: – http://bd2k.nih.gov/faqs_trainingFOA.html – Contact: Michelle Dunn (NCI)  Training Goals: – develop a sufficient cadre of researchers skilled in the science of Big Data – elevate general competencies in data usage and analysis across the biomedical research workforce.
    27. 27. BD2K Training RFAsBD2K Training RFAs  K01s for Mentored Career Development Awards, RFA-HG-14-007  Provides salary and research support for 3-5 years for intensive research career development under the guidance of an experienced mentor in biomedical Big Data Science.  R25s for Courses for Skills Development, RFA-HG- 14-008  Development of creative educational activities with a primary focus on Courses for Skills Development.  R25 for Open Educational Resources, RFA-HG-14- 009  Development of open educational resources (OER) for use by large numbers of learners at all career levels, with a primary focus on Curriculum or Methods Development.
    28. 28. Contemplating CSHL style training center(s)
    29. 29. Associate Director for Data Science Commons Training Center BD2K Modified Review Sustainability* Education* Innovation* Process • Cloud – Data & Compute • Search • Security • Reproducibility Standards • App Store • Coordinate • Hands-on • Syllabus • MOOCs • Community • Centers • Training Grants • Catalogs • Standards • Analysis • Data Resource Support • Metrics • Best Practices • Evaluation • Portfolio Analysis The Biomedical Research Digital Enterprise Communication Collaboration rogrammatic Theme Deliverable Example Features • IC’s • Researchers • Federal Agencies • International Partners • Computer Scientists Scientific Data Council External Advisory Board * Hires made
    30. 30. BD2K InnovationBD2K Innovation  Data Discovery Index Coordination Consortium (U24) (closed)  Metadata standards (under development)  Targeted Software Development Development of Software and Analysis Methods for Biomedical Big Data in Targeted Areas of High Need (U01) –RFA-HG-14-020 –Application receipt date June 20, 2014 –Topics: data compression/reduction, visualization, provenance, or wrangling. –Contact: Jennifer Couch (NCI) and Dave Miller (NCI)
    31. 31. BD2K InnovationBD2K Innovation  BISTI PARs – BISTI: Biomedical Information Science and Technology Initiative – Joint BISTI-BD2K effort – R01s and SBIRs – Contacts: Peter Lyster (NIGMS) and Jennifer Couch (NCI)  Workshops: – Software Index (Last week) • Need to be able to find and cite software, as well as data, to support reproducible science. – Cloud Computing (Summer/Fall 2014) • Biomedical big data are becoming too large to be analyzed on traditional localized computing systems. – Contact: Vivien Bonazzi (NHGRI)
    32. 32. BD2K Innovation CentersBD2K Innovation Centers  FY14  Investigator-initiated Centers of Excellence for Big Data Computing in the Biomedical Sciences (U54) RFA-HG-13-009 (closed)  BD2K-LINCS-Perturbation Data Coordination and Integration Center (DCIC) (U54) RFA-HG-14-001 (closed)
    33. 33. Associate Director for Data Science Commons Training Center BD2K Modified Review Sustainability* Education* Innovation* Process • Cloud – Data & Compute • Search • Security • Reproducibility Standards • App Store • Coordinate • Hands-on • Syllabus • MOOCs • Community • Centers • Training Grants • Catalogs • Standards • Analysis • Data Resource Support • Metrics • Best Practices • Evaluation • Portfolio Analysis The Biomedical Research Digital Enterprise Communication Collaboration rogrammatic Theme Deliverable Example Features • IC’s • Researchers • Federal Agencies • International Partners • Computer Scientists Scientific Data Council External Advisory Board * Hires made
    34. 34. Some Thoughts About Process  Machine readable data sharing plans?  Open review?  Micro funding?  Standing data committees to explore best practices?  Crowd sourcing?
    35. 35. Associate Director for Data Science Commons Training Center BD2K Modified Review Sustainability* Education* Innovation* Process • Cloud – Data & Compute • Search • Security • Reproducibility Standards • App Store • Coordinate • Hands-on • Syllabus • MOOCs • Community • Centers • Training Grants • Catalogs • Standards • Analysis • Data Resource Support • Metrics • Best Practices • Evaluation • Portfolio Analysis The Biomedical Research Digital Enterprise Communication Collaboration rogrammatic Theme Deliverable Example Features • IC’s • Researchers • Federal Agencies • International Partners • Computer Scientists Scientific Data Council External Advisory Board * Hires made
    36. 36. Where Do We Want to End Up?
    37. 37. Associate Director for Data Science Commons Training Center BD2K Modified Review Sustainability* Education* Innovation* Process • Cloud – Data & Compute • Search • Security • Reproducibility Standards • App Store • Coordinate • Hands-on • Syllabus • MOOCs • Community • Centers • Training Grants • Catalogs • Standards • Analysis • Data Resource Support • Metrics • Best Practices • Evaluation • Portfolio Analysis The Biomedical Research Digital Enterprise Communication Collaboration rogrammatic Theme Deliverable Example Features • IC’s • Researchers • Federal Agencies • International Partners • Computer Scientists Scientific Data Council External Advisory Board * Hires made
    38. 38. Components of The Academic Digital Enterprise  Consists of digital assets – E.g. datasets, papers, software, lab notes  Each asset is uniquely identified and has provenance, including access control – E.g. publishing simply involves changing the access control  Digital assets are interoperable across the enterprise
    39. 39. Life in the Academic Digital Enterprise  Jane scores extremely well in parts of her graduate on-line neurology class. Neurology professors, whose research profiles are on-line and well described, are automatically notified of Jane’s potential based on a computer analysis of her scores against the background interests of the neuroscience professors. Consequently, professor Smith interviews Jane and offers her a research rotation. During the rotation she enters details of her experiments related to understanding a widespread neurodegenerative disease in an on-line laboratory notebook kept in a shared on-line research space – an institutional resource where stakeholders provide metadata, including access rights and provenance beyond that available in a commercial offering. According to Jane’s preferences, the underlying computer system may automatically bring to Jane’s attention Jack, a graduate student in the chemistry department whose notebook reveals he is working on using bacteria for purposes of toxic waste cleanup. Why the connection? They reference the same gene a number of times in their notes, which is of interest to two very different disciplines – neurology and environmental sciences. In the analog academic health center they would never have discovered each other, but thanks to the Digital Enterprise, pooled knowledge can lead to a distinct advantage. The collaboration results in the discovery of a homologous human gene product as a putative target in treating the neurodegenerative disorder. A new chemical entity is developed and patented. Accordingly, by automatically matching details of the innovation with biotech companies worldwide that might have potential interest, a licensee is found. The licensee hires Jack to continue working on the project. Jane joins Joe’s laboratory, and he hires another student using the revenue from the license. The research continues and leads to a federal grant award. The students are employed, further research is supported and in time societal benefit arises from the technology. From What Big Data Means to Me JAMIA 2014 21:194
    40. 40. Some Acknowledgements  Eric Green & Mark Guyer (NHGRI)  Jennie Larkin (NHLBI)  Vivien Bonazzi (NHGRI)  Michelle Dunn (NCI)  Mike Huerta (NLM)  David Lipman (NLM)  Jim Ostell (NLM)  Peter Lyster (NIGMS)  All the over 100 folks on the BD2K team
    41. 41. NIHNIH…… Turning Discovery Into HealthTurning Discovery Into Health philip.bourne@nih.gov

    ×