Opportunities and Challenges for International Cooperation Around Big Data
1. Opportunities and Challenges for
International Collaboration Around
Big Data
Philip E. Bourne, PhD
Associate Director for Data Science
National Institutes of Health
philip.bourne@nih.gov
November 12, 2014
3. Top Down
Protein
sequence and
functional
annotation
Protein
sequence and
functional
annotation
CCeelllululalarr m mooddeelsls
Pathway and reaction
annotation
Pathway and reaction
annotation
Protein interaction
annotation
Protein interaction
annotation
Evidence-based proteomics
annotation
Evidence-based proteomics
annotation
Gene Ontology
annotation
Gene Ontology
annotation
VVaarriaiannttss A Annnnoottaattioionn ClinVar / OMIM
MedThesaurus
[adapted from Ioannis Xenarios
5. The NIH Data
Science Mission
Statement
To foster an ecosystem that enables
biomedical* research to be
conducted as a digital enterprise that
enhances health, lengthens life and
reduces illness and disability
* Includes biological, biomedical, behavioral, social,
environmental, and clinical studies that relate to understanding
health and disease.
6.
7. Elements of The Ecosystem
Community Policy
Infrastructure
• Sustainability
• Collaboration
• Training
8. Elements of The Ecosystem
Community Policy
Infrastructure
• Sustainability
Collaboration
• Training
Virtuous
Research
Cycle
10. Policies – Now & Forthcoming
Data Sharing
– Genomic data sharing announced
– Data sharing plans on all research awards
– Data sharing plan enforcement
• Machine readable plan
• Repository requirements to include grant numbers
http://www.nih.gov/news/health/aug2014/od-27.htm
11. Policies - Forthcoming
Data Citation
– Goal: legitimize data as a form of scholarship
– Process:
• Machine readable standard for data citation (done)
• Endorsement of data citation for inclusion in NIH bib
sketch, grants, reports, etc.
• Example formats for human readable data citations
• Slowly work into NLM/NCBI workflow
12. Infrastructure - The
BD2K
Center
Commons
BD2K
Center
BD2K
Center
BD2K
Center
BD2K
Center
BD2K
Center
DDICC
Software
Standard
s
Labs
Labs
Labs
Labs
13. What is the Commons?
A Conceptual Framework for;
Sharing, finding, integrating, reusing and
attributing digital research objects
– “Each digital object has a UID that must allow it to
be found, shared and attributed” – The Commons
Document
The Commons is agnostic of computing platform
14. The Commons:
Framework Implementation
Digital Objects
(with UIDs)
Search
(indexed metadata)
Computing
Platform
The Commons
15. The Commons:
Framework Implementation
Digital Objects
(with UIDs)
Search
(indexed metadata)
Computing
Platform
The Commons
16. The Commons:
Framework Draft Implementations
The Commons
Conceptual Framework
Public Cloud
Platforms
Super Computing
(HPC) Platforms
Other
Platforms ?
Google, AWS (Amazon)
Microsoft (Azure), IBM,
other?
Most easily accessed by
NIH PIs
In house compute
solutions
Private clouds, HPC
– Pharma
– The Broad
– Bionimbus
Low access by NIH PIs
Super Computing 2014
ADDS coordinating
meeting with SC centers
NERSC “Commons Pilot”
17. The Commons:
Framework Implementation
Digital Objects
(with UIDs)
Search
(indexed metadata)
Computing
Platform
The Commons
18. The Commons:
Framework Draft Implementation
The Commons
Conceptual Framework
Digital Objects to populate and test the Commons;
– BD2K centers, NCI Cloud pilots (Google & AWS supported)
– Large Public Data Sets, MODs
Search
– BD2K Data and Software Discovery Indices
– Google Search functions
Use cases
Public Cloud
Platforms
19. The Commons:
Framework Draft Implementation
The Commons
Conceptual Framework
Next Steps
– Determine which BD2K centers are most appropriate for a
cloud Commons pilot
– Develop a plan of action with NCI Cloud pilots
– Working with DDIC/SW Discovery Indices (UIDs, Search)
– Working with Google and AWS (Amazon) to determine what
is needed computationally
• In kind support (short term pilot)
• Conformant clouds (long term sustainable model)
– Developing Use cases!
Public Cloud
Platforms
20. A Business Model for
The Commons
The Commons:
Framework Draft Implementation
22. Community: BD2K Awards
Governance
November 3 Kick-off PI Meeting
– Emphasis on working groups that span centers and begin
the work of building the ecosystem
• Common API development (with GA4GH)
• Mobile
• Metadata
• Grand challenges
– Emphasize sharing from day 1
– Incentivized to work in the Commons
23. Community Short Term Interactions
NSF Workshops and Dear Colleague letter
Workshop with NOAA on public – private
partnerships
ELIXIR Workshop
– Standards
– Training
Workshop Inspiring the Game Developer
Community to Engage in and Enhance
Biomedical Research, Dec 2014
24. Community: Training
Data Science Training Goals
1) Build a digital framework for data science
training:
NIH Data Science Workforce Development Center
1) Develop short-tem training opportunities:
Courses, educational resources, etc.
1) Develop the discipline of biomedical data
science and support cross-training
Goals expanded from recommendations in the June 2012 DIWG and Aug 2013
Training workshop reports.
25. Heads Up on What is Coming in FY15
Calls for using the Commons
Calls for a standards framework development
Calls for software development
Calls to stimulate interactions between communities
(diversity, rotations, library)
Calls for high risk, high return projects
Your ideas here…..
Swiss-Prot annotation efforts are structured in such a way as to cover various community needs. We move from the basic curation of protein sequences and their individual functions – in individual records – to the representation of higher order assemblies of proteins in complexes and networks, and functional pathways. To do this we maintain curation efforts targeting reactions and pathways, GO functions, protein interactions, proteomics annotations…
This means we have a reservoir of prior experience and expertise in this domain.
We actively participate in development of standards and protocols for annotation in the context of numerous consortia.
We have access to the ChEBI and Rhea submission tools. We can create universal, stable identifiers for new lipid species (and any other small molecules).