SlideShare a Scribd company logo
1 of 31
Contributing to the Big
  Data to Knowledge
 Initiative at the NIH




 Data Sharing and the NIH Data Catalog
Big Data to Knowledge
       (BD2K)
Big Data 2 Knowledge



                   Frameworks and   Policies and Data
Data Catalog
                      Standards          Sharing
Data Sharing Repositories

 All NIH-funded data sharing repositories
 that are open to receiving data
 submissions from any researcher
 internationally - whether they are funded
 by the NIH or not
Data Sharing Policies

All data sharing policies that exist within
the NIH that assist researchers in
developing a plan to share their research
data
Big Data 2 Knowledge



                   Frameworks and   Policies and Data
Data Catalog
                      Standards          Sharing
NIH Data Catalog




Bringing Data Into the Research Ecosystem
Datasets are discoverable
Each dataset will be identified via Data Unique
Identifier [DUID] (in NIH Data Catalog and in the
associated journal)

Datasets specified in catalog using MeSH
(creation of a dataset Publication Type)
Datasets are citable
NIH Data Catalog produces citable data
publications

Citability + proper credit = incentives to
submit and publish data
Datasets are linked to the
        literature
  Data citations linked between and across
  the NIH Data Catalog with their associated
  scientific publication in PubMed/PubMed
  Central
Datasets become information in
    the research ecosystem
    Analysis of trends, impact of data, effect on
    NIH research funding
Common Metadata
        Elements



How do current data repositories describe their data?
NIH Data Sharing Repositories
Identifying Metadata Commonalities
Identifying Metadata Commonalities

                                  Date
                              Information
               Data
            Description




                          Authorship




       Common Metadata Elements
Building a Taxonomy of Metadata
            Descriptors
• Title information                 • Authorship
   o   Name Title                     o   Attribution
   o   Collection Type                o   Authors
   o   Type of Deposit                o   Creator(s)
   o   Service Name                   o   Data Authors
   o   Image File Name                o   Data Owner
   o   File Name                      o   Data Attribution
   o   Data Collection Title          o   Contributor(s)
   o   Dataset Title                  o   PI Name(s)
   o   Dataset Name and Accession     o   Investigator(s)
   o   Submission Title               o   Sequence Authors
   o   Lab Data Title                 o   Responsible Party
   o   Research Objective             o   Data Provider
                                      o   Submitter
Mapping Metadata to Existing Standards




   Common
   Common
   Metadata
         Common Metadata
   Metadata Elements
   Elements
   Elements
Mapping to DataCite

•   Common Metadata Elements           •   DataCite Metadata Schema
    o   Data Unique Identifier             o   Identifier
    o   Authorship                         o   Creator
    o   Data Title                         o   Title
    o   Data Location                      o   Publisher
    o   Data Completion/Release Date       o   PublicationYear
    o   Data Descriptors (controlled       o   Subject
        vocabulary)                        o   Contributor
    o   Data Submitter/Affiliation         o   Date
    o   Date Information                   o   Resource Type
    o   Data File Types                    o   RelatedIdentifier
    o   Related Resources                  o   Rights
    o   Access Data Restrictions           o   Description
    o   Data Description (narrative)       o   Size, Format, Version
Mapping to Dryad


•   Common Metadata Elements           •   Dryad Metadata Schema
    o   Data Unique Identifier             o dcterms:identifier/Data
    o   Authorship                           Package Identifier
    o   Data Title                         o dcterms:creator/Author
    o   Data Location                      o dcterms:title/Data Package
    o   Data Completion/Release Date         Title
    o   Data Descriptors (controlled       o dcterms:relation/Location of
        vocabulary)                          related content outside of
                                             Dryad
    o   Data Submitter/Affiliation
                                           o dcterms:available/Date
    o   Date Information                     Available
    o   Data File Types                    o dcterms:description
    o   Related Resources                  o dcterms:subject/Keyword
    o   Access Data Restrictions           o dwc:scientificName
    o   Data Description (narrative)       o dcterms:references/Associated
                                             Dryad publication record ID
Mapping to MEDLINE

  Common Metadata Elements                                    Proposed Definition
Data Unique Identifier                    A unique ID string that identifies a dataset within the catalog

Author                                    Individuals involved in producing or contributing to data

Affiliation                               Affiliation of each author associated with the appropriate author
                                          occurrence
Data Title                                Name or title by which the dataset is known

Data Location                             The name of the entity that holds, archives, publishes,
                                          distributes, releases, issues, or produces the data w/ its
                                          associated accession number
Date                                      The year, month and date when the data was made available

Data Description (structured narrative)   Structured narrative description for efficient indexing

Data Descriptors                          Metadata describing data contents using controlled labels (i.e.
                                          Organism, Disease, Perturbation, Gender, Cell type, etc.)
PMIDs                                     Identifier that will link dataset to associated article(s)

Availability/Accessibility of Data        Whether the data is available to use and how to access it

Award Number                              Grant/award numbers associated with the dataset

Version                                   The version of the dataset (represented as a unique record)
Data Citation - ICMJE
      Author
                                                            Data Title


       Marazita ML, Weynat RJ, Feingold E, Weeks D,
       Crout R, McNeill D. Dental Caries: Whole Genome
       Association and Gene x Environment Studies. NIH
       Data Catalog. 2014 Jan;1(1):DUID00001. PubMed
PMID   PMID: 22123456.
Assigned                       NIH Data Catalog
to NIH                         Volume (Issue)
Data    SI:   dbGaP/pht002543.v2.p1
Catalog
Record
                                              Data Unique         Data
       Secondary         Date data is         Identifier          Location
       source ID (Link   submitted
       to actual         and paper is
       dataset)          ready to
                         publish
NIH Data Catalog Issues
    and Concerns
      What are we missing?
How many NIH
datasets actually exist?
How many unique NIH
   datasets are NOT
represented in existing
  data repositories?
Could these datasets be
 represented as a data
publication instead of in
     a repository?
If the datasets are
   already housed
 somewhere – do we
need a one stop shop?
Is a NIH Data Catalog
   the best solution?
Next Steps
• Find out how many datasets are currently in NIH
  data sharing repositories
   o How many datasets do these repositories process per year?


• How many datasets are unique and NOT housed in a
  repository?
   o Search PubMed and PubMed Central and assign categories
      • MeSH
            o PT: Electronic Supplementary material
            o SH: Statistical and numerical data
            o MeSH: Databases, Factual
   o Statistical Analysis – exclude datasets that already have a location


• How do we manage these unique datasets?
Questions?
  Thank you.

More Related Content

What's hot

Data Publishing at Harvard's Research Data Access Symposium
Data Publishing at Harvard's Research Data Access SymposiumData Publishing at Harvard's Research Data Access Symposium
Data Publishing at Harvard's Research Data Access SymposiumMerce Crosas
 
The expanding dataverse
The expanding dataverseThe expanding dataverse
The expanding dataverseMerce Crosas
 
Metadata lecture riley_2011
Metadata lecture riley_2011Metadata lecture riley_2011
Metadata lecture riley_2011jmcriley
 
Paul2 ecn 2012
Paul2 ecn 2012Paul2 ecn 2012
Paul2 ecn 2012ECNOfficer
 
The Dataverse Commons
The Dataverse CommonsThe Dataverse Commons
The Dataverse CommonsMerce Crosas
 
Metadata lecture 3, metadata schemes
Metadata lecture 3, metadata schemesMetadata lecture 3, metadata schemes
Metadata lecture 3, metadata schemesRichard.Sapon-White
 
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & ReuseLaurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & ReuseGigaScience, BGI Hong Kong
 
Creating a sustainable business model for a digital repository: the Dryad exp...
Creating a sustainable business model for a digital repository: the Dryad exp...Creating a sustainable business model for a digital repository: the Dryad exp...
Creating a sustainable business model for a digital repository: the Dryad exp...ASIS&T
 
Preparing your data for sharing and publishing
Preparing your data for sharing and publishingPreparing your data for sharing and publishing
Preparing your data for sharing and publishingVarsha Khodiyar
 
Emerging Data Citation Infrastructure
Emerging Data Citation InfrastructureEmerging Data Citation Infrastructure
Emerging Data Citation InfrastructureMicah Altman
 
Big Data Repository for Structural Biology: Challenges and Opportunities by P...
Big Data Repository for Structural Biology: Challenges and Opportunities by P...Big Data Repository for Structural Biology: Challenges and Opportunities by P...
Big Data Repository for Structural Biology: Challenges and Opportunities by P...datascienceiqss
 
How to make your published data findable, accessible, interoperable and reusable
How to make your published data findable, accessible, interoperable and reusableHow to make your published data findable, accessible, interoperable and reusable
How to make your published data findable, accessible, interoperable and reusablePhoenix Bioinformatics
 
Making Data Dynamic: Views from UC3, CDL
Making Data Dynamic: Views from UC3, CDLMaking Data Dynamic: Views from UC3, CDL
Making Data Dynamic: Views from UC3, CDLCarly Strasser
 
Sharing Sensitive Data With Confidence: The DataTags system
Sharing Sensitive Data With Confidence: The DataTags systemSharing Sensitive Data With Confidence: The DataTags system
Sharing Sensitive Data With Confidence: The DataTags systemMichael Bar-Sinai
 
Tools and Techniques for Creating, Maintaining, and Distributing Shareable Me...
Tools and Techniques for Creating, Maintaining, and Distributing Shareable Me...Tools and Techniques for Creating, Maintaining, and Distributing Shareable Me...
Tools and Techniques for Creating, Maintaining, and Distributing Shareable Me...Jenn Riley
 

What's hot (20)

Data Publishing at Harvard's Research Data Access Symposium
Data Publishing at Harvard's Research Data Access SymposiumData Publishing at Harvard's Research Data Access Symposium
Data Publishing at Harvard's Research Data Access Symposium
 
The expanding dataverse
The expanding dataverseThe expanding dataverse
The expanding dataverse
 
Metadata lecture riley_2011
Metadata lecture riley_2011Metadata lecture riley_2011
Metadata lecture riley_2011
 
Paul2 ecn 2012
Paul2 ecn 2012Paul2 ecn 2012
Paul2 ecn 2012
 
Dataset Metadata, Tools and Approaches for Access and Preservation
Dataset Metadata, Tools and Approaches for Access and PreservationDataset Metadata, Tools and Approaches for Access and Preservation
Dataset Metadata, Tools and Approaches for Access and Preservation
 
The Dataverse Commons
The Dataverse CommonsThe Dataverse Commons
The Dataverse Commons
 
Metadata lecture 3, metadata schemes
Metadata lecture 3, metadata schemesMetadata lecture 3, metadata schemes
Metadata lecture 3, metadata schemes
 
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & ReuseLaurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
 
Creating a sustainable business model for a digital repository: the Dryad exp...
Creating a sustainable business model for a digital repository: the Dryad exp...Creating a sustainable business model for a digital repository: the Dryad exp...
Creating a sustainable business model for a digital repository: the Dryad exp...
 
Preparing your data for sharing and publishing
Preparing your data for sharing and publishingPreparing your data for sharing and publishing
Preparing your data for sharing and publishing
 
Preservation Metadata
Preservation MetadataPreservation Metadata
Preservation Metadata
 
Emerging Data Citation Infrastructure
Emerging Data Citation InfrastructureEmerging Data Citation Infrastructure
Emerging Data Citation Infrastructure
 
Big Data Repository for Structural Biology: Challenges and Opportunities by P...
Big Data Repository for Structural Biology: Challenges and Opportunities by P...Big Data Repository for Structural Biology: Challenges and Opportunities by P...
Big Data Repository for Structural Biology: Challenges and Opportunities by P...
 
How to make your published data findable, accessible, interoperable and reusable
How to make your published data findable, accessible, interoperable and reusableHow to make your published data findable, accessible, interoperable and reusable
How to make your published data findable, accessible, interoperable and reusable
 
From federated to aggregated search
From federated to aggregated searchFrom federated to aggregated search
From federated to aggregated search
 
Reuse of Repository Data
Reuse of Repository DataReuse of Repository Data
Reuse of Repository Data
 
Linked data in pharma R&D
Linked data in pharma R&DLinked data in pharma R&D
Linked data in pharma R&D
 
Making Data Dynamic: Views from UC3, CDL
Making Data Dynamic: Views from UC3, CDLMaking Data Dynamic: Views from UC3, CDL
Making Data Dynamic: Views from UC3, CDL
 
Sharing Sensitive Data With Confidence: The DataTags system
Sharing Sensitive Data With Confidence: The DataTags systemSharing Sensitive Data With Confidence: The DataTags system
Sharing Sensitive Data With Confidence: The DataTags system
 
Tools and Techniques for Creating, Maintaining, and Distributing Shareable Me...
Tools and Techniques for Creating, Maintaining, and Distributing Shareable Me...Tools and Techniques for Creating, Maintaining, and Distributing Shareable Me...
Tools and Techniques for Creating, Maintaining, and Distributing Shareable Me...
 

Similar to Contributing to the Big Data to Knowledge Initiative at the NIH

Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Todd Vision
 
TDWG_2010_Chavan_data_citation
TDWG_2010_Chavan_data_citationTDWG_2010_Chavan_data_citation
TDWG_2010_Chavan_data_citationVishwas Chavan
 
DataONE Education Module 08: Data Citation
DataONE Education Module 08: Data CitationDataONE Education Module 08: Data Citation
DataONE Education Module 08: Data CitationDataONE
 
Dataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabulariesDataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabulariesValeria Pesce
 
Identity, Location, and Citation at NEON
Identity, Location, and Citation at NEONIdentity, Location, and Citation at NEON
Identity, Location, and Citation at NEONMark Parsons
 
Metadata & controlled vocabulary
Metadata & controlled vocabularyMetadata & controlled vocabulary
Metadata & controlled vocabularyDaryl Superio
 
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas
 
Building an NIH Data Catalog: Bit by Bit
Building an NIH Data Catalog: Bit by BitBuilding an NIH Data Catalog: Bit by Bit
Building an NIH Data Catalog: Bit by Bitreadkev
 
Gaining credit for sharing research data: Viewpoints on Data Publishing
Gaining credit for sharing research data: Viewpoints on Data PublishingGaining credit for sharing research data: Viewpoints on Data Publishing
Gaining credit for sharing research data: Viewpoints on Data PublishingVarsha Khodiyar
 
Citations in ISO Metadata
Citations in ISO MetadataCitations in ISO Metadata
Citations in ISO MetadataTed Habermann
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataStuart Chalk
 
Linking Data to Publications through Citation and Virtual Archives
Linking Data to Publications through Citation and Virtual ArchivesLinking Data to Publications through Citation and Virtual Archives
Linking Data to Publications through Citation and Virtual ArchivesMicah Altman
 
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...Peter McQuilton
 
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...SC CTSI at USC and CHLA
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identificationguest453b14
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identificationguest453b14
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identificationguest453b14
 
Dataset citation and identification
Dataset citation and identificationDataset citation and identification
Dataset citation and identificationAdam Farquhar
 

Similar to Contributing to the Big Data to Knowledge Initiative at the NIH (20)

Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck
 
TDWG_2010_Chavan_data_citation
TDWG_2010_Chavan_data_citationTDWG_2010_Chavan_data_citation
TDWG_2010_Chavan_data_citation
 
DataONE Education Module 08: Data Citation
DataONE Education Module 08: Data CitationDataONE Education Module 08: Data Citation
DataONE Education Module 08: Data Citation
 
Dataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabulariesDataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabularies
 
Identity, Location, and Citation at NEON
Identity, Location, and Citation at NEONIdentity, Location, and Citation at NEON
Identity, Location, and Citation at NEON
 
Metadata & controlled vocabulary
Metadata & controlled vocabularyMetadata & controlled vocabulary
Metadata & controlled vocabulary
 
NIH BD2K DataMed model, DATS
NIH BD2K DataMed model, DATSNIH BD2K DataMed model, DATS
NIH BD2K DataMed model, DATS
 
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
 
Building an NIH Data Catalog: Bit by Bit
Building an NIH Data Catalog: Bit by BitBuilding an NIH Data Catalog: Bit by Bit
Building an NIH Data Catalog: Bit by Bit
 
NISO Forum, Denver, Sept. 24, 2012: Data Equivalence
NISO Forum, Denver, Sept. 24, 2012: Data EquivalenceNISO Forum, Denver, Sept. 24, 2012: Data Equivalence
NISO Forum, Denver, Sept. 24, 2012: Data Equivalence
 
Gaining credit for sharing research data: Viewpoints on Data Publishing
Gaining credit for sharing research data: Viewpoints on Data PublishingGaining credit for sharing research data: Viewpoints on Data Publishing
Gaining credit for sharing research data: Viewpoints on Data Publishing
 
Citations in ISO Metadata
Citations in ISO MetadataCitations in ISO Metadata
Citations in ISO Metadata
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
 
Linking Data to Publications through Citation and Virtual Archives
Linking Data to Publications through Citation and Virtual ArchivesLinking Data to Publications through Citation and Virtual Archives
Linking Data to Publications through Citation and Virtual Archives
 
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
 
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identification
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identification
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identification
 
Dataset citation and identification
Dataset citation and identificationDataset citation and identification
Dataset citation and identification
 

Contributing to the Big Data to Knowledge Initiative at the NIH

  • 1. Contributing to the Big Data to Knowledge Initiative at the NIH Data Sharing and the NIH Data Catalog
  • 2. Big Data to Knowledge (BD2K)
  • 3. Big Data 2 Knowledge Frameworks and Policies and Data Data Catalog Standards Sharing
  • 4. Data Sharing Repositories All NIH-funded data sharing repositories that are open to receiving data submissions from any researcher internationally - whether they are funded by the NIH or not
  • 5. Data Sharing Policies All data sharing policies that exist within the NIH that assist researchers in developing a plan to share their research data
  • 6. Big Data 2 Knowledge Frameworks and Policies and Data Data Catalog Standards Sharing
  • 7. NIH Data Catalog Bringing Data Into the Research Ecosystem
  • 8. Datasets are discoverable Each dataset will be identified via Data Unique Identifier [DUID] (in NIH Data Catalog and in the associated journal) Datasets specified in catalog using MeSH (creation of a dataset Publication Type)
  • 9. Datasets are citable NIH Data Catalog produces citable data publications Citability + proper credit = incentives to submit and publish data
  • 10. Datasets are linked to the literature Data citations linked between and across the NIH Data Catalog with their associated scientific publication in PubMed/PubMed Central
  • 11. Datasets become information in the research ecosystem Analysis of trends, impact of data, effect on NIH research funding
  • 12. Common Metadata Elements How do current data repositories describe their data?
  • 13. NIH Data Sharing Repositories
  • 15. Identifying Metadata Commonalities Date Information Data Description Authorship Common Metadata Elements
  • 16. Building a Taxonomy of Metadata Descriptors • Title information • Authorship o Name Title o Attribution o Collection Type o Authors o Type of Deposit o Creator(s) o Service Name o Data Authors o Image File Name o Data Owner o File Name o Data Attribution o Data Collection Title o Contributor(s) o Dataset Title o PI Name(s) o Dataset Name and Accession o Investigator(s) o Submission Title o Sequence Authors o Lab Data Title o Responsible Party o Research Objective o Data Provider o Submitter
  • 17. Mapping Metadata to Existing Standards Common Common Metadata Common Metadata Metadata Elements Elements Elements
  • 18. Mapping to DataCite • Common Metadata Elements • DataCite Metadata Schema o Data Unique Identifier o Identifier o Authorship o Creator o Data Title o Title o Data Location o Publisher o Data Completion/Release Date o PublicationYear o Data Descriptors (controlled o Subject vocabulary) o Contributor o Data Submitter/Affiliation o Date o Date Information o Resource Type o Data File Types o RelatedIdentifier o Related Resources o Rights o Access Data Restrictions o Description o Data Description (narrative) o Size, Format, Version
  • 19. Mapping to Dryad • Common Metadata Elements • Dryad Metadata Schema o Data Unique Identifier o dcterms:identifier/Data o Authorship Package Identifier o Data Title o dcterms:creator/Author o Data Location o dcterms:title/Data Package o Data Completion/Release Date Title o Data Descriptors (controlled o dcterms:relation/Location of vocabulary) related content outside of Dryad o Data Submitter/Affiliation o dcterms:available/Date o Date Information Available o Data File Types o dcterms:description o Related Resources o dcterms:subject/Keyword o Access Data Restrictions o dwc:scientificName o Data Description (narrative) o dcterms:references/Associated Dryad publication record ID
  • 20. Mapping to MEDLINE Common Metadata Elements Proposed Definition Data Unique Identifier A unique ID string that identifies a dataset within the catalog Author Individuals involved in producing or contributing to data Affiliation Affiliation of each author associated with the appropriate author occurrence Data Title Name or title by which the dataset is known Data Location The name of the entity that holds, archives, publishes, distributes, releases, issues, or produces the data w/ its associated accession number Date The year, month and date when the data was made available Data Description (structured narrative) Structured narrative description for efficient indexing Data Descriptors Metadata describing data contents using controlled labels (i.e. Organism, Disease, Perturbation, Gender, Cell type, etc.) PMIDs Identifier that will link dataset to associated article(s) Availability/Accessibility of Data Whether the data is available to use and how to access it Award Number Grant/award numbers associated with the dataset Version The version of the dataset (represented as a unique record)
  • 21. Data Citation - ICMJE Author Data Title Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID PMID: 22123456. Assigned NIH Data Catalog to NIH Volume (Issue) Data SI: dbGaP/pht002543.v2.p1 Catalog Record Data Unique Data Secondary Date data is Identifier Location source ID (Link submitted to actual and paper is dataset) ready to publish
  • 22.
  • 23.
  • 24. NIH Data Catalog Issues and Concerns What are we missing?
  • 25. How many NIH datasets actually exist?
  • 26. How many unique NIH datasets are NOT represented in existing data repositories?
  • 27. Could these datasets be represented as a data publication instead of in a repository?
  • 28. If the datasets are already housed somewhere – do we need a one stop shop?
  • 29. Is a NIH Data Catalog the best solution?
  • 30. Next Steps • Find out how many datasets are currently in NIH data sharing repositories o How many datasets do these repositories process per year? • How many datasets are unique and NOT housed in a repository? o Search PubMed and PubMed Central and assign categories • MeSH o PT: Electronic Supplementary material o SH: Statistical and numerical data o MeSH: Databases, Factual o Statistical Analysis – exclude datasets that already have a location • How do we manage these unique datasets?

Editor's Notes

  1. an initiative to address how best to manage and utilize the large amounts of biomedical data that new technologies can generate (http://www.nih.gov/news/health/dec2012/od-07.htm). This initiative resulted from a set of recommendations from the Data and Informatics Working Group to the Advisory Committee to the Director, NIH (Data and Informatics Working Group).  As part of the its response to the recommendations, NIH has established a working group to develop plans to implement new programs to increase training in this area, and this working group intends to convene a workshop to discuss training and education needs in how to manage and utilize large complex data sets.  Prior to the workshop, NIH wishes to collect information and relevant materials that will help inform the discussions of the workshop participants.