George A. Komatsoulis, Ph.D.
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
U.S. Department of Health and Human Services
1960 1970 1980 1990 2000 2010 2020
Sensor Stream = 500 EB/day
Stores 69 TB/day
Collection = 14 EB/day
Store 1PB/day
Total Data = 14 PB
Store an average of 3.3TB/day for 10 years!
 A wide range of data categories
 Multiple technology platforms used to generate the same
class of data
 Privacy and Patient protection
 Together = Greater Complexity
DNA Sequence alone could total 250 PB
Source: AAAS (http://www.aaas.org/fy16budget/national-institutes-health)
Public Data Repositories
Local Data
U N I V E R S I T YU N I V E R S I T Y
Locally Developed Software
Publicly Available
Software
Local storage and
compute resources
 Lots of data =
 Difficult to move from place to place
 Expensive to store multiple copies
 Expensive to provide computing capacity
 Even more potentially on the way
 Expanding “data inequality” among researchers
 Budgets are stagnant =
 More resources going to managing data
 Fewer resources available for science
 Current methods of providing IT resources are not efficient
 Is scalable and exploits new computing models
 Is more cost effective given digital growth
 Simplifies sharing digital research objects such
as data, software, metadata and workflows
 Makes digital research objects more FAIR:
Findable, Accessible, Interoperable and
Reusable
 DOES NOT replace existing, well-curated
databases
Phil Bourne, 2014
Software: Services and Tools
Infrastructure
Data
User Data Sets
Reference Data Sets
Services: APIs, IDs, Encapsulation
Analysis Tools/Workflows
UI/App Store
Vivien Bonazzi, 2015
Storage and Compute
QA/QC
Validation
Aggregation
Authoritative NCI
Reference Data Set
Data Coordinating Center
NCI Genomic Data Commons
NCI Clouds
High Performance
Computing
Search/Retrieve
Download
Analysis
NCI Cloud Pilots (X3)
 Commons Supplements – Data, analysis tools, APIs,
containers
 BD2K Centers
 MODs (Model Organism Databases)
 Interoperability (some)
 HMP (Human Microbiome Project)
 The Cloud Credits Model
 Accessing cloud services for NIH grantees
 NIH Affiliated Commons projects
 NIAID/CF/ADDS - Microbiome/HMP Cloud Pilot
 NCI Cloud Pilots & Genomic Data Commons
The Commons
(infrastructure)
Well Curated
Databases
Search Researchers
Indexes Develops capabilities
Makes Available
Investigator
Computes
Downloads
Searches Uses
Indexes Develops capabilities
The Commons
(infrastructure)
Cloud Provider
A
Cloud Provider
B
Cloud Provider
C
Investigator
NIH
Provides credits Enables Search
Discovery Index
Uses credits in
the Commons
IndexesOption:
Direct Funding
 Supports simplified data sharing by driving science into
publicly accessible computing environments that still
provide for investigator level access control
 Scalable for the needs of the scientific community for the
next 5 years
 Democratize access to data and computational tools
 Cost effective
 Competitive marketplace for biomedical computing
services
 Reduces redundancy
 Uses resources efficiently
 Novelty:
 Never been tried, so we don’t have data about likelihood of success
 Cost Models:
 Predicated on stable or declining prices among providers
 True for the last several years, but we can’t guarantee that it will
continue, particularly if there is significant consolidation in industry
 Service Providers:
 Predicated on service providers willing to make the investment to
become conformant
 Market research suggests 3-5 providers within 2-3 months of program
launch
 Persistence:
 The model is ‘Pay As You Go’ which means if you stop paying it stops
going
 Giving investigators an unprecedented level of control over what lives
(or dies) in the Commons
 Minimum set of requirements for
 Business relationships (coordinating center, investigators)
 Interfaces (upload, download, manage, compute)
 Capacity (storage, compute)
 Networking and Connectivity
 Information Assurance
 Authentication and authorization
 A conformant cloud is not necessarily a provider of
Infrastructure as a Service (IaaS) although all providers must
provide IaaS
 Current Conformance Guidelines: http://bit.ly/1SWjBeF
 NIH intends to run a 3 year pilot to test the efficacy of this
business model in enhancing data sharing and reducing
costs.
 Pilot will not directly interact with the existing grant
system, rather, it is being modeled on the mechanisms
being used to gain access to NSF and DOE national
resources (HPC, light sources, etc.)
 The only required qualification for applying for credits will
be that the investigator has an existing NIH grant
 A major element will be the collection of metrics to
assess effectiveness of this model
 NIH recently completed a contract with the CAMH
Federally Funded Research and Development Center
(FFRDC) to act as the coordinating center for this effort.
 We need you to:
 Identify what capabilities will be useful to investigators
 Provide guidance on the conformance requirements
http://bit.ly/1SWjBeF
 Help identify good metrics
 Define the criteria that are used to decide if credit
requests are selected
(or three depending on how you choose to count them)
Demotic Inscription (c. 330 BCE) Crypto-Minoan Inscription (c. 1150 BCE)
 SEER data
 1 | Male
 2 | Female
 3 | Other Hermaphrodite
 4 | Transsexual
 9 | Unknown
• ECOG
– 121102 | Other sex
– 121104 | Ambiguous sex
– F | Female
– FC | Female changed to male
– FP | Female pseudohermaphrodite
– H | Hermaphrodite
– M | Male
– MC | Male changed to female
– MP | Male pseudohermaphrodite
– O | Undetermined sex
– U | Unknown sex
The modern five-line staff was first
adopted in France and became almost
universal by the 16th century
(although the use of staffs with other
numbers of lines was still widespread
well into the 17th century).
-Wikipedia
Guido [d’Arezzo 992-1050 CE] used
the first syllable of each line, Ut, Re,
Mi, Fa, Sol and La, to read notated
music in terms of hexachords; they
were not note names, and each could,
depending on context, be applied to
any note…
Much early music has been “lost” because there was no
standard representation, and the metadata for the non-
standard representations were lost (or never recorded).
• Preserves information with sufficient detail to enable reproduction
• Well documented and used by essentially all western musicians
• Highly flexible, capable of representing a very wide range of
expected states
• Excellent cross-platform interoperability
• Readily transformed into variant representations without loss (MIDI,
ABC, tablature, etc.)
 Advocates of standards need
to resist certain temptations:
 Standardizing too early
 Standardizing data that is
not widely collected
 Allowing information
theories to trump
usability
 Developing standards in
the absence of real
practitioners.
 Traditional Answer for non-digital data is no!
 Does this make sense for digital (or, for that matter, non-digital)
data?
 Are there circumstances where we can or should ‘throw away’
data?
“Usage” ?
“Importance” ?
Ability to reproduce experiment?
Publication Associated?
Time?
Raw vs. Derived?
 NCI (cloud pilots)
 Ishwar Chandramouliswaran
 Stephen Chanock, MD
 Tanja Davidsen, Ph.D.
 Anthony Kerlavage, Ph.D.
 Warren Kibbe, Ph.D.
 Juli Klemm, Ph.D.
 Daoud Meerzaman, Ph.D.
 Louis Staudt, MD, Ph.D.
 Barbara Wold, Ph.D.
 Harold Varmus, MD
 NIH Office of ADDS
 Vivien Bonazzi, Ph.D.
 Philip Bourne, Ph.D
 Jennie Larkin, Ph.D.
 Mark Guyer, Ph.D.
 NCBI
 Dennis Benson, Ph.D.
 Alan Graeff
 David Lipman, MD
 Jim Ostell, Ph.D.
 Don Preuss

The future of the commons

  • 1.
    George A. Komatsoulis,Ph.D. National Center for Biotechnology Information National Library of Medicine National Institutes of Health U.S. Department of Health and Human Services
  • 3.
    1960 1970 19801990 2000 2010 2020
  • 4.
    Sensor Stream =500 EB/day Stores 69 TB/day Collection = 14 EB/day Store 1PB/day Total Data = 14 PB Store an average of 3.3TB/day for 10 years!
  • 5.
     A widerange of data categories  Multiple technology platforms used to generate the same class of data  Privacy and Patient protection  Together = Greater Complexity
  • 6.
    DNA Sequence alonecould total 250 PB
  • 7.
  • 8.
    Public Data Repositories LocalData U N I V E R S I T YU N I V E R S I T Y Locally Developed Software Publicly Available Software Local storage and compute resources
  • 9.
     Lots ofdata =  Difficult to move from place to place  Expensive to store multiple copies  Expensive to provide computing capacity  Even more potentially on the way  Expanding “data inequality” among researchers  Budgets are stagnant =  More resources going to managing data  Fewer resources available for science  Current methods of providing IT resources are not efficient
  • 10.
     Is scalableand exploits new computing models  Is more cost effective given digital growth  Simplifies sharing digital research objects such as data, software, metadata and workflows  Makes digital research objects more FAIR: Findable, Accessible, Interoperable and Reusable  DOES NOT replace existing, well-curated databases Phil Bourne, 2014
  • 11.
    Software: Services andTools Infrastructure Data User Data Sets Reference Data Sets Services: APIs, IDs, Encapsulation Analysis Tools/Workflows UI/App Store Vivien Bonazzi, 2015 Storage and Compute
  • 12.
    QA/QC Validation Aggregation Authoritative NCI Reference DataSet Data Coordinating Center NCI Genomic Data Commons NCI Clouds High Performance Computing Search/Retrieve Download Analysis NCI Cloud Pilots (X3)
  • 13.
     Commons Supplements– Data, analysis tools, APIs, containers  BD2K Centers  MODs (Model Organism Databases)  Interoperability (some)  HMP (Human Microbiome Project)  The Cloud Credits Model  Accessing cloud services for NIH grantees  NIH Affiliated Commons projects  NIAID/CF/ADDS - Microbiome/HMP Cloud Pilot  NCI Cloud Pilots & Genomic Data Commons
  • 14.
    The Commons (infrastructure) Well Curated Databases SearchResearchers Indexes Develops capabilities Makes Available Investigator Computes Downloads Searches Uses Indexes Develops capabilities
  • 15.
    The Commons (infrastructure) Cloud Provider A CloudProvider B Cloud Provider C Investigator NIH Provides credits Enables Search Discovery Index Uses credits in the Commons IndexesOption: Direct Funding
  • 16.
     Supports simplifieddata sharing by driving science into publicly accessible computing environments that still provide for investigator level access control  Scalable for the needs of the scientific community for the next 5 years  Democratize access to data and computational tools  Cost effective  Competitive marketplace for biomedical computing services  Reduces redundancy  Uses resources efficiently
  • 17.
     Novelty:  Neverbeen tried, so we don’t have data about likelihood of success  Cost Models:  Predicated on stable or declining prices among providers  True for the last several years, but we can’t guarantee that it will continue, particularly if there is significant consolidation in industry  Service Providers:  Predicated on service providers willing to make the investment to become conformant  Market research suggests 3-5 providers within 2-3 months of program launch  Persistence:  The model is ‘Pay As You Go’ which means if you stop paying it stops going  Giving investigators an unprecedented level of control over what lives (or dies) in the Commons
  • 18.
     Minimum setof requirements for  Business relationships (coordinating center, investigators)  Interfaces (upload, download, manage, compute)  Capacity (storage, compute)  Networking and Connectivity  Information Assurance  Authentication and authorization  A conformant cloud is not necessarily a provider of Infrastructure as a Service (IaaS) although all providers must provide IaaS  Current Conformance Guidelines: http://bit.ly/1SWjBeF
  • 19.
     NIH intendsto run a 3 year pilot to test the efficacy of this business model in enhancing data sharing and reducing costs.  Pilot will not directly interact with the existing grant system, rather, it is being modeled on the mechanisms being used to gain access to NSF and DOE national resources (HPC, light sources, etc.)  The only required qualification for applying for credits will be that the investigator has an existing NIH grant  A major element will be the collection of metrics to assess effectiveness of this model
  • 20.
     NIH recentlycompleted a contract with the CAMH Federally Funded Research and Development Center (FFRDC) to act as the coordinating center for this effort.  We need you to:  Identify what capabilities will be useful to investigators  Provide guidance on the conformance requirements http://bit.ly/1SWjBeF  Help identify good metrics  Define the criteria that are used to decide if credit requests are selected
  • 21.
    (or three dependingon how you choose to count them)
  • 22.
    Demotic Inscription (c.330 BCE) Crypto-Minoan Inscription (c. 1150 BCE)
  • 23.
     SEER data 1 | Male  2 | Female  3 | Other Hermaphrodite  4 | Transsexual  9 | Unknown • ECOG – 121102 | Other sex – 121104 | Ambiguous sex – F | Female – FC | Female changed to male – FP | Female pseudohermaphrodite – H | Hermaphrodite – M | Male – MC | Male changed to female – MP | Male pseudohermaphrodite – O | Undetermined sex – U | Unknown sex
  • 25.
    The modern five-linestaff was first adopted in France and became almost universal by the 16th century (although the use of staffs with other numbers of lines was still widespread well into the 17th century). -Wikipedia Guido [d’Arezzo 992-1050 CE] used the first syllable of each line, Ut, Re, Mi, Fa, Sol and La, to read notated music in terms of hexachords; they were not note names, and each could, depending on context, be applied to any note… Much early music has been “lost” because there was no standard representation, and the metadata for the non- standard representations were lost (or never recorded).
  • 26.
    • Preserves informationwith sufficient detail to enable reproduction • Well documented and used by essentially all western musicians • Highly flexible, capable of representing a very wide range of expected states • Excellent cross-platform interoperability • Readily transformed into variant representations without loss (MIDI, ABC, tablature, etc.)
  • 27.
     Advocates ofstandards need to resist certain temptations:  Standardizing too early  Standardizing data that is not widely collected  Allowing information theories to trump usability  Developing standards in the absence of real practitioners.
  • 28.
     Traditional Answerfor non-digital data is no!  Does this make sense for digital (or, for that matter, non-digital) data?  Are there circumstances where we can or should ‘throw away’ data?
  • 29.
    “Usage” ? “Importance” ? Abilityto reproduce experiment? Publication Associated? Time? Raw vs. Derived?
  • 30.
     NCI (cloudpilots)  Ishwar Chandramouliswaran  Stephen Chanock, MD  Tanja Davidsen, Ph.D.  Anthony Kerlavage, Ph.D.  Warren Kibbe, Ph.D.  Juli Klemm, Ph.D.  Daoud Meerzaman, Ph.D.  Louis Staudt, MD, Ph.D.  Barbara Wold, Ph.D.  Harold Varmus, MD  NIH Office of ADDS  Vivien Bonazzi, Ph.D.  Philip Bourne, Ph.D  Jennie Larkin, Ph.D.  Mark Guyer, Ph.D.  NCBI  Dennis Benson, Ph.D.  Alan Graeff  David Lipman, MD  Jim Ostell, Ph.D.  Don Preuss

Editor's Notes

  • #4 1965 – Generation capacity < 100 aa’s/year/person => Dayhoff creates 1 base code to simplify computing in punch card era 1977 – Sanger and Maxam-Gilbert Sequencing invented. By mid 1980’s increase in production of 2 orders of magnitude (maybe 10-20K bases total 2-3K finished/year) 1986 – Development of dye based sequencing, ABI 370A 2000 bases/day/instrument by mid 1990’s 1996 – Development of DNA microarrays. 2 dye 100K chips => 200K/chip/day 2000’s- Next gen sequencing; 100M’s/day
  • #9 This has worked well for a long time, but: Every investigator has their own copy of the data Every investigator needs the computational resources to do whatever calculation they want to do. Making locally developed software work outside of the local institution is often a challenge. Everyone likes the Broad Firehose, only Broad has made it work! = DUPLICATION AND INEFFICIENCY PRACTICAL PROBLEMS: Network speeds for transfer Time to run computations Consider the TCGA Data Set (2.5 PB) Storage and Data Protection cost approximately $2,000,000 per year per copy Constant network updates at universities 2.5 PB = 20,000,000 Gb = 23 days at 10 Gb/sec This is an issue with more ‘normal’ sized data sets as well FINALLY: ROI because this stuff rarely winds up leaving the institution. Can’t be easily reused
  • #12 Discuss Some Practical Considerations, co-location of data/software/compute
  • #19 Mimimum Requirements: Business relationship is to allow distribution and billing of credits and to ensure that liability issues are resolved. Investigator that puts digital object in the commons is the one that retains the liability associated with its use. Interfaces – would need to be open, but not necessarily open-source. Requires support for basic operations. In addition, environment has to be open to all; so a private environment behind a university firewall won’t work. Identifiers and metadata: Tied together and together enable researchers to search for and find resources. Networking and Connectivity: Make sure that stuff is accessible, require connection to commodity internet and internet2, but key element from investigator point of view is a free egress tier for academics Environment is secure A&A: Must support inCommon because most NIH investigators have it. Minimizes hassle of granting access to collaborators across multiple platforms. Approval of clouds: Self certify vs. NIH certify vs. 3rd party certify. In early test cases, may simply say ‘FedRamped’ Cloud vs IaaS: Some IaaS (AWS comes to mind) may be uninterested in providing the ‘conformant’ layer but support other companies that provide these services using AWS backend. Already exemplars of this: Seven Bridges Genomics and the Cancer Genomics Cloud Pilots are all software layers over an IaaS provider.
  • #23 Two stone tablets, both bronze age, both with information coded in scripts. However, the one on the left can (and has been) translated. The one on the right, while a bit older, contains data that has been lost, possibly forever. What is the difference?
  • #25 Do these early music groups have any idea what the music they are trying to play sounded during Medieval of Renaissance? Short answer: NO!
  • #27 Comment: Standards really set you free!
  • #29 Given the essential certainty of hacks are we obligated to throw some of this data out?
  • #30 “Usage” ? How do we measure? Publications? “Actual” usage? If so, how much is ‘enough’? For low usage data, should the actual users shoulder the financial burden? “Importance” ? Again, how do we measure? Publications? Accesses? Expert Panels? Ouija Board? How important in order to merit NIH level support? Ability to reproduce without Work on cell lines (easier to reproduce) vs. from needle biopsies of human patients? Publication Associated Publication = forever? Time After some period of time, can data can be thrown away? (like tax returns after 7 years) Raw vs. Derived?