The NIH Data Landscape
• Total data currently in the intramural and
extramural communities is est at 650 PB
• 20 PB of that is in NCBI/NLM(.3%) and it is
expected to grow by 10 PB this year with
genotyping most significant
• 12% of data described in published papers is in
recognized archives – 88% is dark data
• In the 8 years 2007-2014 we spent $1.2Bn on
those archives
• Grappling to understand the research data
lifecycle
Example Data Types
• From Molecules to Populations
– Longitudinal patient data e.g. Framingham
– The Cancer Genome Atlas TCGA – genomic
changes in the over 200 types and many more
subtypes of cancer
– The human microbiome – microbes more
numerous than human cells of which there are
37.2 trillion
BD2K
• 110M per year research project
– 20% on training
– Standards coordination
– Sustainability
– New developments applied to biomedical
problems
• Compression
• Visualization
• Data wrangling
• Privacy
The Commons
• A virtual collaborative space physically
instantiated on public clouds which agree to be
Commons compliant
• Content – data, software, narative etc are
research objects within that space with unique
identifiers and metadata of varying degrees
• Commons compliant means
– They conform to the FAIR principles
• E.g. Find – have early stage indexing engine – Biocaddie
• Like Vegas – what happens in the Commons stays
in the Commons and is shared
Implications
• Potentially more cost effective
• Potentially increases productivity
• Credits model
– Driving competition into the market place
– Only pay for what is used
• Know what is being used when
• Can make informed decisions re sustaining
data

Highlights from NIH Data Science

  • 1.
    The NIH DataLandscape • Total data currently in the intramural and extramural communities is est at 650 PB • 20 PB of that is in NCBI/NLM(.3%) and it is expected to grow by 10 PB this year with genotyping most significant • 12% of data described in published papers is in recognized archives – 88% is dark data • In the 8 years 2007-2014 we spent $1.2Bn on those archives • Grappling to understand the research data lifecycle
  • 2.
    Example Data Types •From Molecules to Populations – Longitudinal patient data e.g. Framingham – The Cancer Genome Atlas TCGA – genomic changes in the over 200 types and many more subtypes of cancer – The human microbiome – microbes more numerous than human cells of which there are 37.2 trillion
  • 3.
    BD2K • 110M peryear research project – 20% on training – Standards coordination – Sustainability – New developments applied to biomedical problems • Compression • Visualization • Data wrangling • Privacy
  • 4.
    The Commons • Avirtual collaborative space physically instantiated on public clouds which agree to be Commons compliant • Content – data, software, narative etc are research objects within that space with unique identifiers and metadata of varying degrees • Commons compliant means – They conform to the FAIR principles • E.g. Find – have early stage indexing engine – Biocaddie • Like Vegas – what happens in the Commons stays in the Commons and is shared
  • 5.
    Implications • Potentially morecost effective • Potentially increases productivity • Credits model – Driving competition into the market place – Only pay for what is used • Know what is being used when • Can make informed decisions re sustaining data