NIH – Big Data to
Knowledge
What is BD2K?
 Why is NIH investing $100M in this?
 For information about BD2K – click here




The following slides are highlights and
notes from NIH workshop events

*Information contained here belongs to the author and is not an official viewpoint of the NIH or any other organization
Drivers behind the BD2K grant
To meet the emerging needs of the
biomedical research community
 To create a better research ecosystem
 NIH seeks to invest in ways to help
researchers easily find, access, analyze,
and curate research data

The Purpose of NIH’s Data
Catalog Workshops
To take steps, independently and in
partnership with others, to enable a
future state in which clinical data
(including electronic health record data)
are used effectively to conduct research
and improve population health
 Workshop participants engage actively
in the discussions helping NIH develop
plans, programs, and funding initiatives
to implement BD2K

Challenges
Data sharing among biomedical
researchers is lacking
 There is no technical infrastructure for NIHfunded researchers to easily submit
datasets associated with their work
 Those datasets are not available to other
researchers
 There is little motivation to share data,
since the most common current unit of
academic credit is co-authorship in the
peer-reviewed literature

NIH’s Goals for BD2K
To advance basic and translational
science by facilitating and enhancing the
sharing of research generated data
 To promote the development of new
analytical methods and software for this
emerging data
 To increase the workforce in quantitative
science toward maximizing the return on
the NIH’s public investment in
biomedical research

NIH’s Goals for BD2K
 To

improve the public’s ability to
discover and access data resulting
from federally funded research
 Researchers want visual analytics,
and to build the database into a
―social network‖ – being able to
―friend‖ or ―like‖ the data
The Model
When the NIH created ClinicalTrials.gov in
collaboration with the Food and Drug
Administration (FDA) and medical journals,
the resource enabled clinical research
investigators to track ongoing or completed
trials. Subsequent requirements to enter
outcome data have added to its value.
 Establishing an analogous repository of
molecular, phenotype, imaging, and other
biomedical research data is of great value
to the biomedical research community.

NIH is looking for solutions


The development and implementation of analytical
methods and software tools valuable to the research
community follow a four-stage process.
 Prototyping within the context of targeted scientific

research projects
 Engineering within robust software tools that provide
appropriate user interfaces and data input/output features
for effective community adoption and utilization
 Dissemination to the research community — this process
that may require the availability of appropriate data
storage and computational resources
 Maintenance and support is required to address users’
questions, community-driven requests for bug fixes,
usability improvements, and new features
The Opportunity
 The

training of future data scientists is
at stake
 The creation of a platform for
scientific communities to share data
with citizen groups
 A new science – new discoveries and
relationships across data
NIH Data Catalog: Future
Vision







Interoperation with other systems,
interdisciplinary collaborations
―Likes‖ and cited metrics helping to find
relevant datasets
Non-obvious relationship discovery
Journals imbed links within publications
Enable learning: educational uses of data
Return data to the community: patients too
can access data
Search is Broken vs. Big
Data






Documents are not
just containers for
keywords.
Objects & meanings
relate to people,
documents, snippets,
tweets, journals,
doctors, caregivers,
patients.
Search is about the
keywords and ignores
everything else.

www.ibm.com
Academic Publishing vs.
Open Access
August 2013 – Univ. of California approved
open access standards for research on all
campuses.
 2012 – Harvard Library urged its 2,100
faculty to boycott for-profit academic
research databases and instead submit
articles to lower-cost open access journals.
 Also, the White House pledged $100
million to promote open access and to
require all federally-funded research to be
free of charge.

Clinical Studies and
Collaboration with
Pharmaceutical Companies: in
 The real-world population is rarely reflected
the selected population of a single clinical trial
data set. Combining and mining multiple data
sets can produce a more holistic view, which is
the standard that both patients and regulators
expect therapies to be measured against.
 Pharma companies need to embrace the
challenge of using combined data sets to
uncover insights they did not previously have.
 This has the potential to benefit both the
competing companies producing drugs and
patients who will have improved outcomes.
Solutions Profile
 There

should be a system put in place
by NIH/NLM for widespread sharing
of data.
 Feedback: ―we have the information,
but we do not know how to use it.‖
 A data system should be created to
integrate data types, capture data,
and create ―space‖ for raw data.
BD2K Overview
Investing in technology and tools needed to
enable researchers to easily find, access,
analyze, and curate research data.
 To increase the capacity of the workforce
(both for experts and non-experts) and
employ strategic planning to leverage IT
advances for the entire NIH community.
 Millions of Americans (citizen scientists)
who may want to research their own
disease history.

The Citizen Scientist






1 million users/patients
download their health data,
much is unreadable.
Mashups occur to build apps
to read health records.
The biomedical research
community is within a few
years of the ―thousand-dollar
human genome needing a
million-dollar interpretation.‖

NIH Big Data to Knowledge (BD2K)

  • 1.
    NIH – BigData to Knowledge What is BD2K?  Why is NIH investing $100M in this?  For information about BD2K – click here   The following slides are highlights and notes from NIH workshop events *Information contained here belongs to the author and is not an official viewpoint of the NIH or any other organization
  • 2.
    Drivers behind theBD2K grant To meet the emerging needs of the biomedical research community  To create a better research ecosystem  NIH seeks to invest in ways to help researchers easily find, access, analyze, and curate research data 
  • 3.
    The Purpose ofNIH’s Data Catalog Workshops To take steps, independently and in partnership with others, to enable a future state in which clinical data (including electronic health record data) are used effectively to conduct research and improve population health  Workshop participants engage actively in the discussions helping NIH develop plans, programs, and funding initiatives to implement BD2K 
  • 4.
    Challenges Data sharing amongbiomedical researchers is lacking  There is no technical infrastructure for NIHfunded researchers to easily submit datasets associated with their work  Those datasets are not available to other researchers  There is little motivation to share data, since the most common current unit of academic credit is co-authorship in the peer-reviewed literature 
  • 5.
    NIH’s Goals forBD2K To advance basic and translational science by facilitating and enhancing the sharing of research generated data  To promote the development of new analytical methods and software for this emerging data  To increase the workforce in quantitative science toward maximizing the return on the NIH’s public investment in biomedical research 
  • 6.
    NIH’s Goals forBD2K  To improve the public’s ability to discover and access data resulting from federally funded research  Researchers want visual analytics, and to build the database into a ―social network‖ – being able to ―friend‖ or ―like‖ the data
  • 7.
    The Model When theNIH created ClinicalTrials.gov in collaboration with the Food and Drug Administration (FDA) and medical journals, the resource enabled clinical research investigators to track ongoing or completed trials. Subsequent requirements to enter outcome data have added to its value.  Establishing an analogous repository of molecular, phenotype, imaging, and other biomedical research data is of great value to the biomedical research community. 
  • 8.
    NIH is lookingfor solutions  The development and implementation of analytical methods and software tools valuable to the research community follow a four-stage process.  Prototyping within the context of targeted scientific research projects  Engineering within robust software tools that provide appropriate user interfaces and data input/output features for effective community adoption and utilization  Dissemination to the research community — this process that may require the availability of appropriate data storage and computational resources  Maintenance and support is required to address users’ questions, community-driven requests for bug fixes, usability improvements, and new features
  • 9.
    The Opportunity  The trainingof future data scientists is at stake  The creation of a platform for scientific communities to share data with citizen groups  A new science – new discoveries and relationships across data
  • 10.
    NIH Data Catalog:Future Vision       Interoperation with other systems, interdisciplinary collaborations ―Likes‖ and cited metrics helping to find relevant datasets Non-obvious relationship discovery Journals imbed links within publications Enable learning: educational uses of data Return data to the community: patients too can access data
  • 11.
    Search is Brokenvs. Big Data    Documents are not just containers for keywords. Objects & meanings relate to people, documents, snippets, tweets, journals, doctors, caregivers, patients. Search is about the keywords and ignores everything else. www.ibm.com
  • 12.
    Academic Publishing vs. OpenAccess August 2013 – Univ. of California approved open access standards for research on all campuses.  2012 – Harvard Library urged its 2,100 faculty to boycott for-profit academic research databases and instead submit articles to lower-cost open access journals.  Also, the White House pledged $100 million to promote open access and to require all federally-funded research to be free of charge. 
  • 13.
    Clinical Studies and Collaborationwith Pharmaceutical Companies: in  The real-world population is rarely reflected the selected population of a single clinical trial data set. Combining and mining multiple data sets can produce a more holistic view, which is the standard that both patients and regulators expect therapies to be measured against.  Pharma companies need to embrace the challenge of using combined data sets to uncover insights they did not previously have.  This has the potential to benefit both the competing companies producing drugs and patients who will have improved outcomes.
  • 14.
    Solutions Profile  There shouldbe a system put in place by NIH/NLM for widespread sharing of data.  Feedback: ―we have the information, but we do not know how to use it.‖  A data system should be created to integrate data types, capture data, and create ―space‖ for raw data.
  • 15.
    BD2K Overview Investing intechnology and tools needed to enable researchers to easily find, access, analyze, and curate research data.  To increase the capacity of the workforce (both for experts and non-experts) and employ strategic planning to leverage IT advances for the entire NIH community.  Millions of Americans (citizen scientists) who may want to research their own disease history. 
  • 16.
    The Citizen Scientist    1million users/patients download their health data, much is unreadable. Mashups occur to build apps to read health records. The biomedical research community is within a few years of the ―thousand-dollar human genome needing a million-dollar interpretation.‖

Editor's Notes

  • #2 Why is the NIH investing $100M at the intersection of data science and health research?