George A. Komatsoulis, Ph.D.
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
U.S. Department of Health and Human Services
TheCommons
Digital Objects
(with identifiers)
Search
(Indexed Metadata and API)
Computing Platform
OpenAPIs
SoftwareEncapsulation
TheCommons
Digital Objects
(with identifiers)
Search
(Indexed Metadata and API)
Computing Platform
Commons
Federation
(Infrastructure)
BD2K
Centers
DDICC
(Search)
Existing
Resources
Indexes Methods
Content
Commons
Federation
(Infrastructure)
BD2K
Centers
DDICC
(Search)
Existing
Resources
Indexes Methods
Content
Investigator
Works In
Searches
Commons
Federation
(Infrastructure)
Conformant Provider
A
Conformant Provider
B
Conformant Provider
C
The Commons: Business Model
Researcher
Discovery Index
The Commons
Cloud Provider
C
Cloud Provider
B
Cloud Provider
A
NIH
Provides Digital Objects
Retrieves/Uses Digital
Objects
Option: Fund Providers to
Support NIH Directed
Resources
Indexes Commons
Provide
Credits
Uses
Credits
Finds
Objects
 Commons
Implemented as a
federation of
‘conformant’ cloud
providers and HPC
environments
 Funded primarily
by providing
credits to
investigators
 Cost effective - Only pay for IT support used
 Drives competition – Better services at lower
cost
 Supports Data sharing by driving science into
the Commons
 Facilitates public-private partnership
 Scalable to most categories of data expected in
the next 5 years.
 Novelty:
 Never been tried, so we don’t have data about likelihood of success
 Cost Models:
 Predicated on stable or declining prices among providers
 True for the last several years, but we can’t guarantee that it will
continue, particularly if there is significant consolidation in industry
 Service Providers:
 Predicated on service providers willing to make the investment to
become conformant
 Market research suggests 3-5 providers within 2-3 months of program
launch
 Persistence:
 The model is ‘Pay As You Go’ which means if you stop paying it stops
going
 Giving investigators an unprecedented level of control over what lives
(or dies) in the Commons
 Minimum set of requirements for
 Business relationships (reseller, investigators)
 Interfaces (upload, download, manage, compute)
 Capacity (storage, compute)
 Networking and Connectivity
 Information Assurance
 Authentication and authorization
 Likely to be reviewed self-certification in pilot phase
 A conformant cloud ≠ an IaaS provider
 Likely to evolve into multiple ‘Levels of Compliance’ corresponding to
increasing degrees of making data/software meet ‘FAIR’ criteria.
 Some of our current thinking for basic compliance
 Objects are physically or logically available in the Commons
 Objects are indexed with a usable identifier
 Objects have basic search metadata attached to index entries
 Objects have clear access rules
 Objects have basic semantic metadata available
 Higher levels could include
 Objects indexed with standards based identifiers (ORCID, doi, etc.)
 Objects are open to the public (or as open as reasonable given data type)
 Objects conform to agreed upon standards (CDISC, DICOM, etc.)
 Data objects are accessible via standard APIs
 Software is encapsulated (containers, other technology) for easier usage
 We want and need your feedback on these matters!
 Phase 0: Build the plumbing
 Phase 1: Pilot the model on a small number of
investigators experienced with cloud computing, probably
within the context of BD2K awards
 Phase 2: Open the Commons credit process to grantees
from a subset of NIH Institutes and Centers
 Phase 3: Open the process to all NIH grantees
 Approved March 23, 2015
 “In light of the advances made in security protocols for cloud
computing in the past several years and given the expansion in
the volume and complexity of genomic data generated by the
research community, the National Institutes of Health (NIH) is
now allowing investigators to request permission to transfer
controlled-access genomic and associated phenotypic data
obtained from NIH-designated data repositories under the
auspices of the NIH Genomic Data Sharing (GDS) Policy to
public or private cloud systems for data storage and analysis.”
 Responsibility for ensuring the security and integrity remains
with the institution.
1960 1970 1980 1990 2000 2010 2020
Sensor Stream = 500 EB/day
Stores 69 TB/day
Collection = 14 EB/day
Store 1PB/day
Total Data = 14 PB
Store an average of 3.3TB/day for 10 years!
 NIH Office of ADDS
 Vivien Bonazzi, Ph.D.
 Philip Bourne, Ph.D
 Michelle Dunn, Ph.D
 Mark Guyer, Ph.D.
 Jennie Larkin, Ph.D.
 Leigh Finnegan
 Beth Russell
 NCBI
 Dennis Benson, Ph.D.
 Alan Graeff
 David Lipman, MD
 Jim Ostell, Ph.D.
 Don Preuss
 Steve Sherry

Komatsoulis internet2 executive track

  • 1.
    George A. Komatsoulis,Ph.D. National Center for Biotechnology Information National Library of Medicine National Institutes of Health U.S. Department of Health and Human Services
  • 3.
    TheCommons Digital Objects (with identifiers) Search (IndexedMetadata and API) Computing Platform OpenAPIs SoftwareEncapsulation
  • 4.
    TheCommons Digital Objects (with identifiers) Search (IndexedMetadata and API) Computing Platform Commons Federation (Infrastructure) BD2K Centers DDICC (Search) Existing Resources Indexes Methods Content
  • 5.
  • 6.
  • 7.
    The Commons: BusinessModel Researcher Discovery Index The Commons Cloud Provider C Cloud Provider B Cloud Provider A NIH Provides Digital Objects Retrieves/Uses Digital Objects Option: Fund Providers to Support NIH Directed Resources Indexes Commons Provide Credits Uses Credits Finds Objects  Commons Implemented as a federation of ‘conformant’ cloud providers and HPC environments  Funded primarily by providing credits to investigators
  • 8.
     Cost effective- Only pay for IT support used  Drives competition – Better services at lower cost  Supports Data sharing by driving science into the Commons  Facilitates public-private partnership  Scalable to most categories of data expected in the next 5 years.
  • 9.
     Novelty:  Neverbeen tried, so we don’t have data about likelihood of success  Cost Models:  Predicated on stable or declining prices among providers  True for the last several years, but we can’t guarantee that it will continue, particularly if there is significant consolidation in industry  Service Providers:  Predicated on service providers willing to make the investment to become conformant  Market research suggests 3-5 providers within 2-3 months of program launch  Persistence:  The model is ‘Pay As You Go’ which means if you stop paying it stops going  Giving investigators an unprecedented level of control over what lives (or dies) in the Commons
  • 10.
     Minimum setof requirements for  Business relationships (reseller, investigators)  Interfaces (upload, download, manage, compute)  Capacity (storage, compute)  Networking and Connectivity  Information Assurance  Authentication and authorization  Likely to be reviewed self-certification in pilot phase  A conformant cloud ≠ an IaaS provider
  • 11.
     Likely toevolve into multiple ‘Levels of Compliance’ corresponding to increasing degrees of making data/software meet ‘FAIR’ criteria.  Some of our current thinking for basic compliance  Objects are physically or logically available in the Commons  Objects are indexed with a usable identifier  Objects have basic search metadata attached to index entries  Objects have clear access rules  Objects have basic semantic metadata available  Higher levels could include  Objects indexed with standards based identifiers (ORCID, doi, etc.)  Objects are open to the public (or as open as reasonable given data type)  Objects conform to agreed upon standards (CDISC, DICOM, etc.)  Data objects are accessible via standard APIs  Software is encapsulated (containers, other technology) for easier usage  We want and need your feedback on these matters!
  • 12.
     Phase 0:Build the plumbing  Phase 1: Pilot the model on a small number of investigators experienced with cloud computing, probably within the context of BD2K awards  Phase 2: Open the Commons credit process to grantees from a subset of NIH Institutes and Centers  Phase 3: Open the process to all NIH grantees
  • 15.
     Approved March23, 2015  “In light of the advances made in security protocols for cloud computing in the past several years and given the expansion in the volume and complexity of genomic data generated by the research community, the National Institutes of Health (NIH) is now allowing investigators to request permission to transfer controlled-access genomic and associated phenotypic data obtained from NIH-designated data repositories under the auspices of the NIH Genomic Data Sharing (GDS) Policy to public or private cloud systems for data storage and analysis.”  Responsibility for ensuring the security and integrity remains with the institution.
  • 17.
    1960 1970 19801990 2000 2010 2020
  • 18.
    Sensor Stream =500 EB/day Stores 69 TB/day Collection = 14 EB/day Store 1PB/day Total Data = 14 PB Store an average of 3.3TB/day for 10 years!
  • 20.
     NIH Officeof ADDS  Vivien Bonazzi, Ph.D.  Philip Bourne, Ph.D  Michelle Dunn, Ph.D  Mark Guyer, Ph.D.  Jennie Larkin, Ph.D.  Leigh Finnegan  Beth Russell  NCBI  Dennis Benson, Ph.D.  Alan Graeff  David Lipman, MD  Jim Ostell, Ph.D.  Don Preuss  Steve Sherry

Editor's Notes

  • #11 Mimimum Requirements: Business relationship is to allow distribution and billing of credits and to ensure that liability issues are resolved. Investigator that puts digital object in the commons is the one that retains the liability associated with its use. Interfaces – would need to be open, but not necessarily open-source. Requires support for basic operations. In addition, environment has to be open to all; so a private environment behind a university firewall won’t work. Identifiers and metadata: Tied together and together enable researchers to search for and find resources. Networking and Connectivity: Make sure that stuff is accessible, require connection to commodity internet and internet2, but key element from investigator point of view is a free egress tier for academics Environment is secure A&A: Must support inCommon because most NIH investigators have it. Minimizes hassle of granting access to collaborators across multiple platforms. Approval of clouds: Self certify vs. NIH certify vs. 3rd party certify. In early test cases, may simply say ‘FedRamped’ Cloud vs IaaS: Some IaaS (AWS comes to mind) may be uninterested in providing the ‘conformant’ layer but support other companies that provide these services using AWS backend. Already exemplars of this: Seven Bridges Genomics and the Cancer Genomics Cloud Pilots are all software layers over an IaaS provider.
  • #18 1965 – Generation capacity < 100 aa’s/year/person => Dayhoff creates 1 base code to simplify computing in punch card era 1977 – Sanger and Maxam-Gilbert Sequencing invented. By mid 1980’s increase in production of 2 orders of magnitude (maybe 10-20K bases total 2-3K finished/year) 1986 – Development of dye based sequencing, ABI 370A 2000 bases/day/instrument by mid 1990’s 1996 – Development of DNA microarrays. 2 dye 100K chips => 200K/chip/day 2000’s- Next gen sequencing; 100M’s/day