NIH’s Day, Months, Four Years of 
Data 
Philip E. Bourne Ph.D. 
Associate Director for Data Science 
National Institutes of Health
What is NIH’s Overall Approach to 
Data and What Does It Mean to You?
Some Context: NIH Data Science History 
6/12 2/14 3/14 
• Findings: 
• Sharing data & software through catalogs 
• Support methods and applications development 
• Need more training 
• Need campus-wide IT strategy 
• Hire CSIO 
• Continued support throughout the lifecycle
My Bias 
 Still a scientist 
 A funder who still thinks like a PI 
 Not yet attuned to the federal system 
 Big supporter of open science through prior work with 
the Public Library of Science, FORCE11 etc.
Data – A Few Observations … 
 We talk about the promise of big data, but we don’t 
even know the value of little data (aka could “Big 
Data” be the new “AI”) 
 Good data is expensive in terms of time and money 
 Looking at data retroactively is really expensive 
 Good data begats trust; trust begats community; 
community is God 
 The way we support scientific data currently is not 
sustainable 
 That is, is no business model currently for scientific 
data
Data – A Few NIH Observations … 
1. We have little idea how much we spend on data – 
estimated over $1bn per year 
2. We have even less idea how much we should be 
spending 
 Point 2 is part of a culture clash between the more 
observational history of biomedicine and the new 
analytical approach to discovery
Data – An Academic Medical Center 
Observation 
 A digital enterprise exists when data connections are 
made across different areas of the organization such 
that productivity and competitiveness are improved 
 For example, between education and research 
 I am not aware that any such academic institutions 
exist? 
 Many are starting to wake up to the idea of getting 
there 
JAMIA 2014, 21(2), 194
ADDS Mission 
Statement 
To foster an ecosystem that enables 
biomedical research to be conducted 
as a digital enterprise that enhances 
health, lengthens life and reduces 
illness and disability
What Problems Are We Trying to Solve? 
One Possible Solution 
 Sustainability – 50% business model 
 Efficiency – sharing best practices in longitudinal 
clinical studies; “trusted investigator” 
 Collaboration - identification of collaborators at the 
point of data collection not publication 
 Reproducibility – data accessible with publication 
 Integration – phenotype homogenization 
 Accessibility – clinical trials registration 
 Quality – sharing CDEs across institutes 
 Training – keeping trainees in the ecosystem
The Data Ecosystem 
Community Policy 
Infrastructure 
• Sustainable 
business 
model 
• Collaboration 
• Training
The Data Ecosystem 
Community Policy 
Infrastructure 
• Sustainable 
business 
model 
• Collaboration 
• Training 
Virtuous 
Research 
Cycle
The Virtuous Cycle 
http://goo.gl/fkWjhS
Raw Materials to Seed the Ecosystem 
 NIH mandate & support 
 ADDS team of 8 people 
 Intramural participation of over 100 team members 
across ICs 
 Funding through BD2K: 
– ~$30M in FY14 
– ~$80M in FY15 
– ....
Organization to Seed the Ecosystem…
Associate Director for Data Science 
Scientific Data Council External Advisory Board 
Programmatic Theme 
Sustainability Education Innovation Process 
Deliverable 
Commons Training 
BD2K Efficiency 
Example Features • IC’s 
• Cloud – Data & 
Compute 
• Search 
• Security 
• Reproducibility 
Standards 
• App Store 
• Coordinate 
• Hands-on 
• Syllabus 
• MOOCs 
• Community 
• Centers 
• Training Grants 
• Catalogs 
• Standards 
• Analysis 
• Data 
Resource 
Support 
• Metrics 
• Best 
Practices 
• Evaluation 
• Portfolio 
Analysis 
Collaboration 
Partnerships 
• Researchers 
• Federal 
Agencies 
• International 
Partners 
• Computer 
Scientists 
The Biomedical Research Digital Enterprise
Associate Director for Data Science 
Scientific Data Council External Advisory Board 
Programmatic Theme 
Sustainability Education Innovation Process 
Deliverable 
Commons Training 
BD2K Efficiency 
Example Features • IC’s 
• Cloud – Data & 
Compute 
• Search 
• Security 
• Reproducibility 
Standards 
• App Store 
• Coordinate 
• Hands-on 
• Syllabus 
• MOOCs 
• Community 
• Centers 
• Training Grants 
• Catalogs 
• Standards 
• Analysis 
• Data 
Resource 
Support 
• Metrics 
• Best 
Practices 
• Evaluation 
• Portfolio 
Analysis 
Collaboration 
Partnerships 
• Researchers 
• Federal 
Agencies 
• International 
Partners 
• Computer 
Scientists 
The Biomedical Research Digital Enterprise
Example Communities 
– NIH 
• 27 ICs 
– Agencies 
• NSF 
• DOE 
• DARPA 
• NIST 
– Government 
• OSTP 
• HHS HDI 
• ONC 
• CDC 
• FDA 
– Private sector 
• Phrma 
• Google 
• Amazon 
– Organizations 
• PCORI, GA4GH 
• RDA, ELIXIR 
• CCC 
• CATS 
• FASEB, ISCB 
• Biophysical Society 
• Sloan Foundation 
• Moore Foundation
Example Policies 
– Clinical data harmonization 
– DbGaP in the cloud 
– Data citation 
– Machine readable data sharing plans on all grants 
– New review models, audiences etc. 
• Open review 
• Micro funding 
• Standing data committees to explore best 
practices 
• Crowd sourcing
Example Infrastructure: The Commons 
Data 
The Long Tail 
Core Facilities/HS Centers 
Clinical /Patient 
The Why: 
Data Sharing Plans 
The How: 
NIH Knowledge 
The 
Government 
Commons 
Data 
Discovery 
Index 
The End Game: 
Scientific 
Discovery 
Usability 
Quality 
Security/ 
Privacy 
Sustainable 
Storage 
Awardees 
Private 
Sector Metrics/ 
Standards 
Rest of 
Academia 
Software Standards 
Index 
BD2K 
Centers 
Cloud, Research Objects, 
Business Models
What Does the Commons Enable? 
 Dropbox like storage 
 The opportunity to apply quality metrics 
 Bring compute to the data 
 A place to collaborate 
 A place to discover 
http://100plus.com/wp-content/uploads/Data-Commons-3- 
1024x825.png
One Possible Commons Business Model 
[Adapted from George Komatsoulis] 
HPC, Institution …
Pilots Around A Virtuous Cycle 
Expect a FY15 Funding Call to Work in 
the Commons
TTrraaiinniinngg && DDiivveerrssiittyy 
 Training & Diversity Goals: 
– Develop a sufficient cadre of diverse researchers skilled in 
the science of Big Data 
– Elevate general competencies in data usage and analysis 
across the biomedical research workforce 
– Combat the Google bus 
 How: 
– Traditional training grants 
– Work with IC’s on a needs assessment 
– Standards for course descriptions with EU 
– Work with institutions on raising awareness 
– Partner with minority institutions 
– Virtual/physical training center(s)?
Closing Question 
Calls for increased NIH funding has so 
far gone unheeded, what can the 
ADDS do (that you have not heard 
about) to improve data science 
activities?
NNIIHH…… 
philip.bourne@nih.gov 
TTuurrnniinngg DDiissccoovveerryy IInnttoo HHeeaalltthh

Yale Day of Data

  • 1.
    NIH’s Day, Months,Four Years of Data Philip E. Bourne Ph.D. Associate Director for Data Science National Institutes of Health
  • 2.
    What is NIH’sOverall Approach to Data and What Does It Mean to You?
  • 3.
    Some Context: NIHData Science History 6/12 2/14 3/14 • Findings: • Sharing data & software through catalogs • Support methods and applications development • Need more training • Need campus-wide IT strategy • Hire CSIO • Continued support throughout the lifecycle
  • 4.
    My Bias Still a scientist  A funder who still thinks like a PI  Not yet attuned to the federal system  Big supporter of open science through prior work with the Public Library of Science, FORCE11 etc.
  • 5.
    Data – AFew Observations …  We talk about the promise of big data, but we don’t even know the value of little data (aka could “Big Data” be the new “AI”)  Good data is expensive in terms of time and money  Looking at data retroactively is really expensive  Good data begats trust; trust begats community; community is God  The way we support scientific data currently is not sustainable  That is, is no business model currently for scientific data
  • 6.
    Data – AFew NIH Observations … 1. We have little idea how much we spend on data – estimated over $1bn per year 2. We have even less idea how much we should be spending  Point 2 is part of a culture clash between the more observational history of biomedicine and the new analytical approach to discovery
  • 7.
    Data – AnAcademic Medical Center Observation  A digital enterprise exists when data connections are made across different areas of the organization such that productivity and competitiveness are improved  For example, between education and research  I am not aware that any such academic institutions exist?  Many are starting to wake up to the idea of getting there JAMIA 2014, 21(2), 194
  • 8.
    ADDS Mission Statement To foster an ecosystem that enables biomedical research to be conducted as a digital enterprise that enhances health, lengthens life and reduces illness and disability
  • 9.
    What Problems AreWe Trying to Solve? One Possible Solution  Sustainability – 50% business model  Efficiency – sharing best practices in longitudinal clinical studies; “trusted investigator”  Collaboration - identification of collaborators at the point of data collection not publication  Reproducibility – data accessible with publication  Integration – phenotype homogenization  Accessibility – clinical trials registration  Quality – sharing CDEs across institutes  Training – keeping trainees in the ecosystem
  • 10.
    The Data Ecosystem Community Policy Infrastructure • Sustainable business model • Collaboration • Training
  • 11.
    The Data Ecosystem Community Policy Infrastructure • Sustainable business model • Collaboration • Training Virtuous Research Cycle
  • 12.
    The Virtuous Cycle http://goo.gl/fkWjhS
  • 13.
    Raw Materials toSeed the Ecosystem  NIH mandate & support  ADDS team of 8 people  Intramural participation of over 100 team members across ICs  Funding through BD2K: – ~$30M in FY14 – ~$80M in FY15 – ....
  • 14.
    Organization to Seedthe Ecosystem…
  • 15.
    Associate Director forData Science Scientific Data Council External Advisory Board Programmatic Theme Sustainability Education Innovation Process Deliverable Commons Training BD2K Efficiency Example Features • IC’s • Cloud – Data & Compute • Search • Security • Reproducibility Standards • App Store • Coordinate • Hands-on • Syllabus • MOOCs • Community • Centers • Training Grants • Catalogs • Standards • Analysis • Data Resource Support • Metrics • Best Practices • Evaluation • Portfolio Analysis Collaboration Partnerships • Researchers • Federal Agencies • International Partners • Computer Scientists The Biomedical Research Digital Enterprise
  • 16.
    Associate Director forData Science Scientific Data Council External Advisory Board Programmatic Theme Sustainability Education Innovation Process Deliverable Commons Training BD2K Efficiency Example Features • IC’s • Cloud – Data & Compute • Search • Security • Reproducibility Standards • App Store • Coordinate • Hands-on • Syllabus • MOOCs • Community • Centers • Training Grants • Catalogs • Standards • Analysis • Data Resource Support • Metrics • Best Practices • Evaluation • Portfolio Analysis Collaboration Partnerships • Researchers • Federal Agencies • International Partners • Computer Scientists The Biomedical Research Digital Enterprise
  • 17.
    Example Communities –NIH • 27 ICs – Agencies • NSF • DOE • DARPA • NIST – Government • OSTP • HHS HDI • ONC • CDC • FDA – Private sector • Phrma • Google • Amazon – Organizations • PCORI, GA4GH • RDA, ELIXIR • CCC • CATS • FASEB, ISCB • Biophysical Society • Sloan Foundation • Moore Foundation
  • 18.
    Example Policies –Clinical data harmonization – DbGaP in the cloud – Data citation – Machine readable data sharing plans on all grants – New review models, audiences etc. • Open review • Micro funding • Standing data committees to explore best practices • Crowd sourcing
  • 19.
    Example Infrastructure: TheCommons Data The Long Tail Core Facilities/HS Centers Clinical /Patient The Why: Data Sharing Plans The How: NIH Knowledge The Government Commons Data Discovery Index The End Game: Scientific Discovery Usability Quality Security/ Privacy Sustainable Storage Awardees Private Sector Metrics/ Standards Rest of Academia Software Standards Index BD2K Centers Cloud, Research Objects, Business Models
  • 20.
    What Does theCommons Enable?  Dropbox like storage  The opportunity to apply quality metrics  Bring compute to the data  A place to collaborate  A place to discover http://100plus.com/wp-content/uploads/Data-Commons-3- 1024x825.png
  • 21.
    One Possible CommonsBusiness Model [Adapted from George Komatsoulis] HPC, Institution …
  • 22.
    Pilots Around AVirtuous Cycle Expect a FY15 Funding Call to Work in the Commons
  • 23.
    TTrraaiinniinngg && DDiivveerrssiittyy  Training & Diversity Goals: – Develop a sufficient cadre of diverse researchers skilled in the science of Big Data – Elevate general competencies in data usage and analysis across the biomedical research workforce – Combat the Google bus  How: – Traditional training grants – Work with IC’s on a needs assessment – Standards for course descriptions with EU – Work with institutions on raising awareness – Partner with minority institutions – Virtual/physical training center(s)?
  • 24.
    Closing Question Callsfor increased NIH funding has so far gone unheeded, what can the ADDS do (that you have not heard about) to improve data science activities?
  • 25.
    NNIIHH…… philip.bourne@nih.gov TTuurrnniinnggDDiissccoovveerryy IInnttoo HHeeaalltthh

Editor's Notes

  • #18 Computing Community Consortium Committee on Applied and Theoretical Statistics Office of the National Coordinator