Data Science &
BD2K Update
Philip Bourne, PhD, FACMI
Associate Director for Data Science
Advisory Committee to the NIH Director
June 10, 2016
http://datascience.nih.gov
Slides: http://www.slideshare.net/pebourne
Data Science Agenda
• What problems are we trying to solve?
• What are the solutions we are exploring?
• How does BD2K facilitate those solutions?
What Problems Are We
Trying to Solve?
• Data are extensive, complex and growing
• Data are in silos while science transcends
those silos
• Data are expensive to maintain and share
while demands for sharing are increasing
• There is an insufficient workforce with the
needed data analytical skills
• A collective (trans NIH) solution
Quantifying the Problem
• Big Data
– Total data from NIH-funded research currently
estimated at 650 PB*
– 20 PB of that is in NCBI/NLM (3%) and it is expected
to grow by 10 PB this year
• Dark Data
– Only 12% of data described in published papers is in
recognized archives – 88% is dark data^
• Cost
– 2007-2014: NIH spent ~$1.2Bn extramurally on
maintaining data archives
* In 2012 Library of Congress was 3 PB
^ http://www.ncbi.nlm.nih.gov/pubmed/26207759
*Note: Award data confirmed as of 03/2016. Some repositories funded by hybrid mechanisms (eg. grants-contracts, IAA-contracts, etc.)
Biomedical Digital Data
Repository Survey by Institute
and Center (IC)
• Leadership meeting late in 2015 requested a
survey of IC approaches and plans for data
repositories
• Responses received from 18 IC’s
• Clear challenges were identified
The Major Challenge
Encountered When Considering
Repository Funding
Cost
(6 ICs)
Lack of Expertise
(Within and Outside
NIH)
(3 ICs)
Utility
(3 ICs)
Lack of
Trans-
NIH
Guidance
or
Best
Practices
(2 ICs)
Sustainability
(6 ICs)
Redundancy
(4 ICs)
* Some IC’s identified multiple challenges equally
What Solutions Are We
Exploring?
The Commons is one solution that
leverages the experiences in
cloud-based computing and is
being enabled by BD2K research
Examples of Cloud Based
Initiatives
5 PB
40TB AWS
The Commons – The Internet of
Data
• Findable
• Accessible
• Interoperable
• Reusable
* http://www.ncbi.nlm.nih.gov/pubmed/26978244
The Commons offers a path forward to integrate
these discreet cloud-based initiatives using BD2K
developments to make data FAIR*
The internet started as discreet networks that
merged - the same could happen with data
Use Case:
Aggregate integrated data offers
the potential for new insights into
rare diseases …
As we get more precise every disease becomes a rare disease
Diffuse Intrinsic Pontine
Gliomas (DIPG): In need of a
new data-driven approach
• Occur 1:100,000
individuals
• Peak incidence 6-8 years
of age
• Median survival 9-12
months
• Surgery is not an option
• Chemotherapy ineffective
and radiotherapy only
transitive
From Adam Resnick
Timeline of Genomic Studies
in DIPG
• Landmark studies
identify histone mutations
as recurrent driver
mutations in DIPG ~2012
• Almost 3 years later, in
largely the same
datasets, but partially
expanded, the same two
groups and 2 others
identify ACVR1
mutations as a
secondary, co-ocurring
mutation
From Adam Resnick
Hypothesis: The Commons
would have revealed ACVR1
• ACVR1 is a targetable kinase
• Inhibition of ACVR1 inhibited tumor
progression in vitro
• ~300 DIPG patients a year
• ~60 are predicted to have ACVR1
• If large scale data sets were only
integrated with TCGA and/or rare
disease data in 2012, ACVR1 mutations
would have been identified
• 60 patients/year X 3 years = 180
children’s lives (who likely succumbed to
the disease during that time) could have
been impacted if only data were FAIR
From Adam Resnick
A 3 Year BD2K Sponsored
Commons Pilot is Under Way
– Questions to be addressed:
• Does the ability to compute across very large
datasets lead to new discoveries?
• Are data and analytics more easily located and
shared and does this improve productivity?
• Is there an advantage to have the results of those
large calculations also available?
• Is research more reproducible?
• Is this environment more cost-effective than what
we do now?
Another use case…
Let’s review the Commons pilot
using the Model Organism
Databases (MODs) as an
example …
Example of the Problem:
The Model Organism Databases (MODS)
• Highly curated and
valuable data
• Siloed / Not
interoperable
• Cumbersome to
compute over all the
data
• Costly to maintain as
individual resources
NHGRI & NHLBI
Shared Research Objects
NCBI Intramural
Hybrid
Extramural
NCBI/NLM Existing Coordinating CentersCIT IC’s
Step 1: Data & Analytics
Moved to the Commons
• Moved as Commons compliant shared
research objects, including:
– Identifiers
– Minimal metadata standards
CF
MOD Data
BD2K is Providing Those
Metadata Standards
NCBI/NLM Existing Coordinating CentersCIT IC’s
Services: APIs, Containers, Indexing,Services: APIs, Containers, Indexing,
Software: Services & ToolsSoftware: Services & Tools
scientific analysis tools/workflowsscientific analysis tools/workflows
App store/User InterfaceApp store/User Interface
Step 2: Layers of Software &
Services Added
Shared Research Objects
NCBI Intramural
Hybrid
Extramural
DataMed is a “Find” Service
Developed by BD2K
MOD Data indexed
NCBI/NLM Existing Coordinating CentersCIT IC’s
Services: APIs, Containers, Indexing,Services: APIs, Containers, Indexing,
Software: Services & ToolsSoftware: Services & Tools
scientific analysis tools/workflowsscientific analysis tools/workflows
App store/User InterfaceApp store/User Interface
Step 3: Commons Content Shared
While Maintaining Autonomous
Views
CF
ICIC ICIC ICIC CFCF
Shared Research Objects
NCBI Intramural
Hybrid
Extramural
MOD Data
View
BD2K Commons Pilot
Timeline
Project Year 1
FY 2015
Oct 2015 – Sep 2016
Project Year 2
FY 2016
Oct 2016 – Sep 2017
Project Year 3
FY 2017
Oct 2017 – Sep 2018
Step 0Step 0 Step 1Step 1 Step 2Step 2 Step 3Step 3
Step 0: Initiation
•Finalize conformance requirements
•Arrange initial providers
Step 1: ~5 Initial Projects including Common Fund We Are Here
Step 2: ~50 projects
Step 3: Evaluation & Next Steps
The Major Challenge
Encountered When Considering
Repository Funding
Cost
(6 ICs)
Lack of Expertise
(Within and Outside
NIH)
(3 ICs)
Utility
(3 ICs)
Lack of
Trans-
NIH
Guidance
or
Best
Practices
(2 ICs)
Sustainability
(6 ICs)
Redundancy
(4 ICs)
* Some IC’s identified multiple challenges equally
16 T32/T15
Predoctoral
Training
Programs
21
Postdoctoral and
Faculty Career
Awards
Enhancing Diversity
• Focus on low-resourced institutions
– Supports curriculum and faculty development
– Supports research experiences for
undergraduates
• Builds partnerships with BD2K Centers
Improving Data Science
Skills Among all Biomedical
Scientists
24 awards
1 award
The Role of BD2K
1. Commons
– Resource
Indexing
– Standards
– Cloud & HPC
– Sustainability
2. Data Science
Research
– Centers
– Software
Analysis &
Methods
3. Training & Workforce Development
NIHNIH……
Turning Discovery Into HealthTurning Discovery Into Health
philip.bourne@nih.gov
https://datascience.nih.gov/
• Pi Day
• 2016 Lecture by Carlos Bustamante
• Poster Session with Pies
• PiCo Lightning Talks
• Pi Day Scholars: outreach to high schools
• Workshop: Reproducible Research
• Lecture Series: Distinguished and Frontiers in Data Science
• Data Science Courses
• Machine learning
• Hackathons
Data Science Events at NIH

Data Science BD2K Update for NIH

  • 1.
    Data Science & BD2KUpdate Philip Bourne, PhD, FACMI Associate Director for Data Science Advisory Committee to the NIH Director June 10, 2016 http://datascience.nih.gov Slides: http://www.slideshare.net/pebourne
  • 2.
    Data Science Agenda •What problems are we trying to solve? • What are the solutions we are exploring? • How does BD2K facilitate those solutions?
  • 3.
    What Problems AreWe Trying to Solve? • Data are extensive, complex and growing • Data are in silos while science transcends those silos • Data are expensive to maintain and share while demands for sharing are increasing • There is an insufficient workforce with the needed data analytical skills • A collective (trans NIH) solution
  • 4.
    Quantifying the Problem •Big Data – Total data from NIH-funded research currently estimated at 650 PB* – 20 PB of that is in NCBI/NLM (3%) and it is expected to grow by 10 PB this year • Dark Data – Only 12% of data described in published papers is in recognized archives – 88% is dark data^ • Cost – 2007-2014: NIH spent ~$1.2Bn extramurally on maintaining data archives * In 2012 Library of Congress was 3 PB ^ http://www.ncbi.nlm.nih.gov/pubmed/26207759
  • 5.
    *Note: Award dataconfirmed as of 03/2016. Some repositories funded by hybrid mechanisms (eg. grants-contracts, IAA-contracts, etc.)
  • 6.
    Biomedical Digital Data RepositorySurvey by Institute and Center (IC) • Leadership meeting late in 2015 requested a survey of IC approaches and plans for data repositories • Responses received from 18 IC’s • Clear challenges were identified
  • 7.
    The Major Challenge EncounteredWhen Considering Repository Funding Cost (6 ICs) Lack of Expertise (Within and Outside NIH) (3 ICs) Utility (3 ICs) Lack of Trans- NIH Guidance or Best Practices (2 ICs) Sustainability (6 ICs) Redundancy (4 ICs) * Some IC’s identified multiple challenges equally
  • 8.
    What Solutions AreWe Exploring? The Commons is one solution that leverages the experiences in cloud-based computing and is being enabled by BD2K research
  • 9.
    Examples of CloudBased Initiatives 5 PB 40TB AWS
  • 10.
    The Commons –The Internet of Data • Findable • Accessible • Interoperable • Reusable * http://www.ncbi.nlm.nih.gov/pubmed/26978244 The Commons offers a path forward to integrate these discreet cloud-based initiatives using BD2K developments to make data FAIR* The internet started as discreet networks that merged - the same could happen with data
  • 11.
    Use Case: Aggregate integrateddata offers the potential for new insights into rare diseases … As we get more precise every disease becomes a rare disease
  • 12.
    Diffuse Intrinsic Pontine Gliomas(DIPG): In need of a new data-driven approach • Occur 1:100,000 individuals • Peak incidence 6-8 years of age • Median survival 9-12 months • Surgery is not an option • Chemotherapy ineffective and radiotherapy only transitive From Adam Resnick
  • 13.
    Timeline of GenomicStudies in DIPG • Landmark studies identify histone mutations as recurrent driver mutations in DIPG ~2012 • Almost 3 years later, in largely the same datasets, but partially expanded, the same two groups and 2 others identify ACVR1 mutations as a secondary, co-ocurring mutation From Adam Resnick
  • 14.
    Hypothesis: The Commons wouldhave revealed ACVR1 • ACVR1 is a targetable kinase • Inhibition of ACVR1 inhibited tumor progression in vitro • ~300 DIPG patients a year • ~60 are predicted to have ACVR1 • If large scale data sets were only integrated with TCGA and/or rare disease data in 2012, ACVR1 mutations would have been identified • 60 patients/year X 3 years = 180 children’s lives (who likely succumbed to the disease during that time) could have been impacted if only data were FAIR From Adam Resnick
  • 15.
    A 3 YearBD2K Sponsored Commons Pilot is Under Way – Questions to be addressed: • Does the ability to compute across very large datasets lead to new discoveries? • Are data and analytics more easily located and shared and does this improve productivity? • Is there an advantage to have the results of those large calculations also available? • Is research more reproducible? • Is this environment more cost-effective than what we do now?
  • 16.
    Another use case… Let’sreview the Commons pilot using the Model Organism Databases (MODs) as an example …
  • 17.
    Example of theProblem: The Model Organism Databases (MODS) • Highly curated and valuable data • Siloed / Not interoperable • Cumbersome to compute over all the data • Costly to maintain as individual resources NHGRI & NHLBI
  • 18.
    Shared Research Objects NCBIIntramural Hybrid Extramural NCBI/NLM Existing Coordinating CentersCIT IC’s Step 1: Data & Analytics Moved to the Commons • Moved as Commons compliant shared research objects, including: – Identifiers – Minimal metadata standards CF MOD Data
  • 19.
    BD2K is ProvidingThose Metadata Standards
  • 20.
    NCBI/NLM Existing CoordinatingCentersCIT IC’s Services: APIs, Containers, Indexing,Services: APIs, Containers, Indexing, Software: Services & ToolsSoftware: Services & Tools scientific analysis tools/workflowsscientific analysis tools/workflows App store/User InterfaceApp store/User Interface Step 2: Layers of Software & Services Added Shared Research Objects NCBI Intramural Hybrid Extramural
  • 21.
    DataMed is a“Find” Service Developed by BD2K MOD Data indexed
  • 22.
    NCBI/NLM Existing CoordinatingCentersCIT IC’s Services: APIs, Containers, Indexing,Services: APIs, Containers, Indexing, Software: Services & ToolsSoftware: Services & Tools scientific analysis tools/workflowsscientific analysis tools/workflows App store/User InterfaceApp store/User Interface Step 3: Commons Content Shared While Maintaining Autonomous Views CF ICIC ICIC ICIC CFCF Shared Research Objects NCBI Intramural Hybrid Extramural MOD Data View
  • 23.
    BD2K Commons Pilot Timeline ProjectYear 1 FY 2015 Oct 2015 – Sep 2016 Project Year 2 FY 2016 Oct 2016 – Sep 2017 Project Year 3 FY 2017 Oct 2017 – Sep 2018 Step 0Step 0 Step 1Step 1 Step 2Step 2 Step 3Step 3 Step 0: Initiation •Finalize conformance requirements •Arrange initial providers Step 1: ~5 Initial Projects including Common Fund We Are Here Step 2: ~50 projects Step 3: Evaluation & Next Steps
  • 24.
    The Major Challenge EncounteredWhen Considering Repository Funding Cost (6 ICs) Lack of Expertise (Within and Outside NIH) (3 ICs) Utility (3 ICs) Lack of Trans- NIH Guidance or Best Practices (2 ICs) Sustainability (6 ICs) Redundancy (4 ICs) * Some IC’s identified multiple challenges equally
  • 25.
  • 26.
    Enhancing Diversity • Focuson low-resourced institutions – Supports curriculum and faculty development – Supports research experiences for undergraduates • Builds partnerships with BD2K Centers
  • 27.
    Improving Data Science SkillsAmong all Biomedical Scientists 24 awards 1 award
  • 28.
    The Role ofBD2K 1. Commons – Resource Indexing – Standards – Cloud & HPC – Sustainability 2. Data Science Research – Centers – Software Analysis & Methods 3. Training & Workforce Development
  • 29.
    NIHNIH…… Turning Discovery IntoHealthTurning Discovery Into Health philip.bourne@nih.gov https://datascience.nih.gov/
  • 30.
    • Pi Day •2016 Lecture by Carlos Bustamante • Poster Session with Pies • PiCo Lightning Talks • Pi Day Scholars: outreach to high schools • Workshop: Reproducible Research • Lecture Series: Distinguished and Frontiers in Data Science • Data Science Courses • Machine learning • Hackathons Data Science Events at NIH

Editor's Notes

  • #4 Holdren memo GDS policy
  • #5 $1.25bn per year to capture all data. After a significant effort at reduction, intramurally data is spread across > 60 data centers; imagine the extramural situation.
  • #6 30-40 of the 777 obligated awards were co-funded as opposed to funded by solely one IC Supplemental slides provides a further breakdown. Say ‘cannot be business as usual and that we must transition to a new model’
  • #8 Utility – see as a value proposition.
  • #10 Goal is to be able to easily draw upon data across these initiatives. Internet of data will see these individual initiatives merge.
  • #16 Other federal agencies are adopting this model and the EU is investing 6 bn Euros based on this model.
  • #19 We are currently in the pilot phase with small amounts of data/analytics being migrated.
  • #20 CEDAR is developing metadata templates SCC – coordinates standards to address such questions as “Is there a standard for data type x?” Working groups are spontaneous efforts between the centers to develop and share developments, including standards.
  • #23 The Commons enables data sharing across IC’s but also enables them to have an autonomous view on the data and services they provide.
  • #25 Utility – see as a value proposition.
  • #26 The biomedical data science pipeline is centered around the T32/T15 Predoctoral Training Programs. We are funding 16, each with 6 trainees, across the country. Over a quarter of the training budget is going to this program. Postdocs and beyond specialize in biomedical data science through mentored protected time for career development.
  • #28 Educational resource discovery index uses data science to help s learn about data science Automatically discovers training resources using information extraction (this is an international collaboration Organizes training resources through manual curation and resource modeling Personalizes recommendations through predictive modeling (future work)