SlideShare a Scribd company logo
1 of 20
Download to read offline
Metadata in the BioSample
Repository are Impaired by
Numerous Anomalies
Rafael Gonçalves
Stanford University
Metadata Are Essential In Science
• Metadata are crucial for finding, reproducing,
and reusing the data that they describe
• The FAIR data principles specify desirable
criteria that metadata and their datasets should
meet to be Findable, Accessible, Interoperable,
and Reusable
• For metadata to be interoperable, they should
rely on controlled terms from ontologies
1
The NCBI BioSample Metadata
Repository (2011-)
2
The NCBI BioSample Metadata
Repository (2011-)
3
• BioSample stores metadata that describe
biological materials (samples) under investigation
• BioSample was designed to standardize
descriptions of samples for all NCBI repositories
• Our BioSample dump contains 6,615,347 records
• Collected on June 25th, 2017
A BioSample Metadata Record
4
Study description & Raw data
Design of Metadata Quality Study
5
Design of Metadata Quality Study
6
Ø Do attribute names correspond to
ontology terms?
Ø Are the attribute names used in
metadata records specified by BioSample?
Design of Metadata Quality Study
7
Ø Are the attribute values valid
according to their specification?
E.g., is the value of a numeric
attribute truly a number?
BioSample Metadata Are
Categorized Into Packages
• BioSample provides specifications of 104 packages
• E.g., Human, Microbe, Virus, Plant, Pathogen, etc.
• A package specifies the set of mandatory and optional
attributes that should be used to describe samples
8
BioSample
Metadata Record
BioSample
Metadata
Repository
contains
adheres to
BioSample
Package
1
1
The Human Package Specification
9
Metadata Records Define
Multiple Attributes
• Each attribute (name-value pair) represents a
characteristic of a sample
• BioSample specifies a dictionary of attributes
• 452 attribute names and their expected value types
• Users can provide attributes with arbitrary names
10
BioSample
Metadata Record
BioSample
Attribute
defines
Attribute
Name
Attribute
Value
BioSample
Metadata
Repository
contains
adheres to
BioSample
Package
composed
of
1
* 1 1
1
1
Attribute Types Under Analysis
• Integer - require values that are integers
• Boolean - require values that are Booleans
• Value set - take on values from value sets defined in
the BioSample documentation
• Ontology term - take on term values from specific
ontologies
11
Most Metadata Submissions Do
Not Adhere To Packages
12
The Vast Majority of Attribute
Names Are Defined By Users
13
No correspondence
with ontology terms!
Summary of Results
14
Simple Fields Have a Wide Range
of Values
15
• Boolean-type attributes have many values that do not
parse into Booleans
• For example, for the smoker attribute, there are such diverse
values as: Non-smoker, nonsmoker, non smoker, ex-smoker, Ex
smoker, smoker, former-smoker, Former, current, …
• While most Integer values are valid, there are many
values that are plain text, for example: e;N/A, NO,
UVPgt59.4, pig, JM52, stock_180.92, ...
• Data types of attribute values are not enforced!
Ontology Term Attributes are Mostly
Populated with Invalid Values
16
• Example values for the disease metadata attribute: No
Adenomas, BrCa, presumed normal, no AD evident at
demise, “NL smooth muscle, stomach rmvd as part of
pancr., CA”, …
• Example values for phenotype metadata attribute:
unknown, monster, wild_type, none, 30 psu, “The 136
mutant has a shorter root meristem and a reduction in
root length of about 75% compared to wild type”, …
BioSample Lacks Standardization
of Metadata Attributes
• The values for attributes defined by BioSample
are not appropriately verified
• Neither the attribute names nor (most of) their
values rely on ontologies
• To be FAIR, the metadata in BioSample would
have to improve considerably
17
A Solution: The CEDAR Workbench
See the CEDAR Resource Track paper & talk
Tuesday - Session 6 - Biomedical and scientific applications
18
Questions
See the CEDAR Resource Track paper & talk
Tuesday - Session 6 - Biomedical and scientific applications

More Related Content

What's hot

The FAIRDOM Commons for Systems Biology
The FAIRDOM Commons for Systems BiologyThe FAIRDOM Commons for Systems Biology
The FAIRDOM Commons for Systems BiologyFAIRDOM
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppSimon Jupp
 
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...Susanna-Assunta Sansone
 
Citing data in research articles: principles, implementation, challenges - an...
Citing data in research articles: principles, implementation, challenges - an...Citing data in research articles: principles, implementation, challenges - an...
Citing data in research articles: principles, implementation, challenges - an...FAIRDOM
 
schema.org and biomedical ontologies
schema.org and biomedical ontologies schema.org and biomedical ontologies
schema.org and biomedical ontologies Simon Jupp
 
Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Todd Vision
 
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ... Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...Syed Ahmad Chan Bukhari, PhD
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge DiscoveryMichel Dumontier
 
The Dryad Digital Repository: Published evolutionary data as part of the gre...
The Dryad Digital Repository: Published evolutionary data as part of the gre...The Dryad Digital Repository: Published evolutionary data as part of the gre...
The Dryad Digital Repository: Published evolutionary data as part of the gre...Todd Vision
 
Research data and scholarly publications: going from casual acquaintances to ...
Research data and scholarly publications: going from casual acquaintances to ...Research data and scholarly publications: going from casual acquaintances to ...
Research data and scholarly publications: going from casual acquaintances to ...Todd Vision
 
Knowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, BonnKnowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, BonnTodd Vision
 
Presentation from Code Camp 2017
Presentation from Code Camp 2017Presentation from Code Camp 2017
Presentation from Code Camp 2017Mitch Miller
 
DAS game: how a programmer thinks
DAS game: how a programmer thinksDAS game: how a programmer thinks
DAS game: how a programmer thinksRafael C. Jimenez
 
Fairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology viewsFairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology viewsTim Clark
 
Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...Todd Vision
 
Reproducible and citable data and models: an introduction.
Reproducible and citable data and models: an introduction.Reproducible and citable data and models: an introduction.
Reproducible and citable data and models: an introduction.FAIRDOM
 
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Michel Dumontier
 
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...Syed Ahmad Chan Bukhari, PhD
 
Annotopia open annotation services platform
Annotopia open annotation services platformAnnotopia open annotation services platform
Annotopia open annotation services platformTim Clark
 

What's hot (20)

The FAIRDOM Commons for Systems Biology
The FAIRDOM Commons for Systems BiologyThe FAIRDOM Commons for Systems Biology
The FAIRDOM Commons for Systems Biology
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-jupp
 
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
 
Citing data in research articles: principles, implementation, challenges - an...
Citing data in research articles: principles, implementation, challenges - an...Citing data in research articles: principles, implementation, challenges - an...
Citing data in research articles: principles, implementation, challenges - an...
 
schema.org and biomedical ontologies
schema.org and biomedical ontologies schema.org and biomedical ontologies
schema.org and biomedical ontologies
 
Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck
 
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ... Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
 
The Dryad Digital Repository: Published evolutionary data as part of the gre...
The Dryad Digital Repository: Published evolutionary data as part of the gre...The Dryad Digital Repository: Published evolutionary data as part of the gre...
The Dryad Digital Repository: Published evolutionary data as part of the gre...
 
Research data and scholarly publications: going from casual acquaintances to ...
Research data and scholarly publications: going from casual acquaintances to ...Research data and scholarly publications: going from casual acquaintances to ...
Research data and scholarly publications: going from casual acquaintances to ...
 
Knowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, BonnKnowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, Bonn
 
Presentation from Code Camp 2017
Presentation from Code Camp 2017Presentation from Code Camp 2017
Presentation from Code Camp 2017
 
DAS game: how a programmer thinks
DAS game: how a programmer thinksDAS game: how a programmer thinks
DAS game: how a programmer thinks
 
Fairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology viewsFairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology views
 
Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...
 
Reproducible and citable data and models: an introduction.
Reproducible and citable data and models: an introduction.Reproducible and citable data and models: an introduction.
Reproducible and citable data and models: an introduction.
 
Crosslinks
Crosslinks Crosslinks
Crosslinks
 
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
 
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
 
Annotopia open annotation services platform
Annotopia open annotation services platformAnnotopia open annotation services platform
Annotopia open annotation services platform
 

Similar to Metadata in the BioSample Online Repository are Impaired by Numerous Anomalies (SemSci 2017 Workshop)

Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEPrashantSharma807
 
Designing a community resource - Sandra Orchard
Designing a community resource - Sandra OrchardDesigning a community resource - Sandra Orchard
Designing a community resource - Sandra OrchardEMBL-ABR
 
Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003robertstevens65
 
ELSS use cases and strategy
ELSS use cases and strategyELSS use cases and strategy
ELSS use cases and strategyAnton Yuryev
 
A Semantic Web based Framework for Linking Healthcare Information with Comput...
A Semantic Web based Framework for Linking Healthcare Information with Comput...A Semantic Web based Framework for Linking Healthcare Information with Comput...
A Semantic Web based Framework for Linking Healthcare Information with Comput...Koray Atalag
 
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...Peter McQuilton
 
Standardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarStandardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarAhmad C. Bukhari
 
10th Compound Libraries Conference - 27 - 29 October, 2014 - Hotel Palace Be...
10th Compound Libraries Conference  - 27 - 29 October, 2014 - Hotel Palace Be...10th Compound Libraries Conference  - 27 - 29 October, 2014 - Hotel Palace Be...
10th Compound Libraries Conference - 27 - 29 October, 2014 - Hotel Palace Be...Torben Haagh
 
10th International Conference Compound Libraries 2014
10th International Conference Compound Libraries  201410th International Conference Compound Libraries  2014
10th International Conference Compound Libraries 2014Torben Haagh
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesLeighton Pritchard
 
GARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant ScienceGARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant ScienceDavid Johnson
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Greg Landrum
 
Clinical modelling with openEHR Archetypes
Clinical modelling with openEHR ArchetypesClinical modelling with openEHR Archetypes
Clinical modelling with openEHR ArchetypesKoray Atalag
 

Similar to Metadata in the BioSample Online Repository are Impaired by Numerous Anomalies (SemSci 2017 Workshop) (20)

Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
 
Designing a community resource - Sandra Orchard
Designing a community resource - Sandra OrchardDesigning a community resource - Sandra Orchard
Designing a community resource - Sandra Orchard
 
Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003
 
ELSS use cases and strategy
ELSS use cases and strategyELSS use cases and strategy
ELSS use cases and strategy
 
A Semantic Web based Framework for Linking Healthcare Information with Comput...
A Semantic Web based Framework for Linking Healthcare Information with Comput...A Semantic Web based Framework for Linking Healthcare Information with Comput...
A Semantic Web based Framework for Linking Healthcare Information with Comput...
 
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
 
Ebp for the ebp
Ebp for the ebpEbp for the ebp
Ebp for the ebp
 
Physiotherapy: Searching for Evidence
Physiotherapy: Searching for EvidencePhysiotherapy: Searching for Evidence
Physiotherapy: Searching for Evidence
 
Molecular modeling database
Molecular modeling database Molecular modeling database
Molecular modeling database
 
Standardization of the HIPC Data Templates
Standardization of the HIPC Data TemplatesStandardization of the HIPC Data Templates
Standardization of the HIPC Data Templates
 
Standardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarStandardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So Far
 
10th Compound Libraries Conference - 27 - 29 October, 2014 - Hotel Palace Be...
10th Compound Libraries Conference  - 27 - 29 October, 2014 - Hotel Palace Be...10th Compound Libraries Conference  - 27 - 29 October, 2014 - Hotel Palace Be...
10th Compound Libraries Conference - 27 - 29 October, 2014 - Hotel Palace Be...
 
10th International Conference Compound Libraries 2014
10th International Conference Compound Libraries  201410th International Conference Compound Libraries  2014
10th International Conference Compound Libraries 2014
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In Sequences
 
GARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant ScienceGARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant Science
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
 
Proteins databases
Proteins databasesProteins databases
Proteins databases
 
Clinical modelling with openEHR Archetypes
Clinical modelling with openEHR ArchetypesClinical modelling with openEHR Archetypes
Clinical modelling with openEHR Archetypes
 
Slide sharenursing jan_2013
Slide sharenursing jan_2013Slide sharenursing jan_2013
Slide sharenursing jan_2013
 
SLAS Screen Design and Assay Technology SIG: SLAS2013 Presentation
SLAS Screen Design and Assay Technology SIG: SLAS2013 PresentationSLAS Screen Design and Assay Technology SIG: SLAS2013 Presentation
SLAS Screen Design and Assay Technology SIG: SLAS2013 Presentation
 

Recently uploaded

Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxSulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxnoordubaliya2003
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 

Recently uploaded (20)

Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptx
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxSulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 

Metadata in the BioSample Online Repository are Impaired by Numerous Anomalies (SemSci 2017 Workshop)

  • 1. Metadata in the BioSample Repository are Impaired by Numerous Anomalies Rafael Gonçalves Stanford University
  • 2. Metadata Are Essential In Science • Metadata are crucial for finding, reproducing, and reusing the data that they describe • The FAIR data principles specify desirable criteria that metadata and their datasets should meet to be Findable, Accessible, Interoperable, and Reusable • For metadata to be interoperable, they should rely on controlled terms from ontologies 1
  • 3. The NCBI BioSample Metadata Repository (2011-) 2
  • 4. The NCBI BioSample Metadata Repository (2011-) 3 • BioSample stores metadata that describe biological materials (samples) under investigation • BioSample was designed to standardize descriptions of samples for all NCBI repositories • Our BioSample dump contains 6,615,347 records • Collected on June 25th, 2017
  • 5. A BioSample Metadata Record 4 Study description & Raw data
  • 6. Design of Metadata Quality Study 5
  • 7. Design of Metadata Quality Study 6 Ø Do attribute names correspond to ontology terms? Ø Are the attribute names used in metadata records specified by BioSample?
  • 8. Design of Metadata Quality Study 7 Ø Are the attribute values valid according to their specification? E.g., is the value of a numeric attribute truly a number?
  • 9. BioSample Metadata Are Categorized Into Packages • BioSample provides specifications of 104 packages • E.g., Human, Microbe, Virus, Plant, Pathogen, etc. • A package specifies the set of mandatory and optional attributes that should be used to describe samples 8 BioSample Metadata Record BioSample Metadata Repository contains adheres to BioSample Package 1 1
  • 10. The Human Package Specification 9
  • 11. Metadata Records Define Multiple Attributes • Each attribute (name-value pair) represents a characteristic of a sample • BioSample specifies a dictionary of attributes • 452 attribute names and their expected value types • Users can provide attributes with arbitrary names 10 BioSample Metadata Record BioSample Attribute defines Attribute Name Attribute Value BioSample Metadata Repository contains adheres to BioSample Package composed of 1 * 1 1 1 1
  • 12. Attribute Types Under Analysis • Integer - require values that are integers • Boolean - require values that are Booleans • Value set - take on values from value sets defined in the BioSample documentation • Ontology term - take on term values from specific ontologies 11
  • 13. Most Metadata Submissions Do Not Adhere To Packages 12
  • 14. The Vast Majority of Attribute Names Are Defined By Users 13 No correspondence with ontology terms!
  • 16. Simple Fields Have a Wide Range of Values 15 • Boolean-type attributes have many values that do not parse into Booleans • For example, for the smoker attribute, there are such diverse values as: Non-smoker, nonsmoker, non smoker, ex-smoker, Ex smoker, smoker, former-smoker, Former, current, … • While most Integer values are valid, there are many values that are plain text, for example: e;N/A, NO, UVPgt59.4, pig, JM52, stock_180.92, ... • Data types of attribute values are not enforced!
  • 17. Ontology Term Attributes are Mostly Populated with Invalid Values 16 • Example values for the disease metadata attribute: No Adenomas, BrCa, presumed normal, no AD evident at demise, “NL smooth muscle, stomach rmvd as part of pancr., CA”, … • Example values for phenotype metadata attribute: unknown, monster, wild_type, none, 30 psu, “The 136 mutant has a shorter root meristem and a reduction in root length of about 75% compared to wild type”, …
  • 18. BioSample Lacks Standardization of Metadata Attributes • The values for attributes defined by BioSample are not appropriately verified • Neither the attribute names nor (most of) their values rely on ontologies • To be FAIR, the metadata in BioSample would have to improve considerably 17
  • 19. A Solution: The CEDAR Workbench See the CEDAR Resource Track paper & talk Tuesday - Session 6 - Biomedical and scientific applications 18
  • 20. Questions See the CEDAR Resource Track paper & talk Tuesday - Session 6 - Biomedical and scientific applications