Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Metadata in the BioSample Repository are Impaired by Numerous Anomalies

56 views

Published on

In this talk I present a study of the quality of metadata in the NCBI BioSample repository. I discovered that metadata in BioSample are generally of poor quality. Particularly, the BioSample metadata make little to no use of terms from ontologies to facilitate (meta)data interoperability.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Metadata in the BioSample Repository are Impaired by Numerous Anomalies

  1. 1. Metadata in the BioSample Repository are Impaired by Numerous Anomalies Rafael Gonçalves Stanford University
  2. 2. Metadata Are Essential In Science • Metadata are crucial for finding, reproducing, and reusing the data that they describe • The FAIR data principles specify desirable criteria that metadata and their datasets should meet to be Findable, Accessible, Interoperable, and Reusable • For metadata to be interoperable, they should rely on controlled terms from ontologies 1
  3. 3. The NCBI BioSample Metadata Repository (2011-) 2
  4. 4. The NCBI BioSample Metadata Repository (2011-) 3 • BioSample stores metadata that describe biological materials (samples) under investigation • BioSample was designed to standardize descriptions of samples for all NCBI repositories • Our BioSample dump contains 6,615,347 records • Collected on June 25th, 2017
  5. 5. A BioSample Metadata Record 4 Study description & Raw data
  6. 6. Design of Metadata Quality Study 5
  7. 7. Design of Metadata Quality Study 6 Ø Do attribute names correspond to ontology terms? Ø Are the attribute names used in metadata records specified by BioSample?
  8. 8. Design of Metadata Quality Study 7 Ø Are the attribute values valid according to their specification? E.g., is the value of a numeric attribute truly a number?
  9. 9. BioSample Metadata Are Categorized Into Packages • BioSample provides specifications of 104 packages • E.g., Human, Microbe, Virus, Plant, Pathogen, etc. • A package specifies the set of mandatory and optional attributes that should be used to describe samples 8 BioSample Metadata Record BioSample Metadata Repository contains adheres to BioSample Package 1 1
  10. 10. The Human Package Specification 9
  11. 11. Metadata Records Define Multiple Attributes • Each attribute (name-value pair) represents a characteristic of a sample • BioSample specifies a dictionary of attributes • 452 attribute names and their expected value types • Users can provide attributes with arbitrary names 10 BioSample Metadata Record BioSample Attribute defines Attribute Name Attribute Value BioSample Metadata Repository contains adheres to BioSample Package composed of 1 * 1 1 1 1
  12. 12. Attribute Types Under Analysis • Integer - require values that are integers • Boolean - require values that are Booleans • Value set - take on values from value sets defined in the BioSample documentation • Ontology term - take on term values from specific ontologies 11
  13. 13. Most Metadata Submissions Do Not Adhere To Packages 12
  14. 14. The Vast Majority of Attribute Names Are Defined By Users 13 No correspondence with ontology terms!
  15. 15. Summary of Results 14
  16. 16. Simple Fields Have a Wide Range of Values 15 • Boolean-type attributes have many values that do not parse into Booleans • For example, for the smoker attribute, there are such diverse values as: Non-smoker, nonsmoker, non smoker, ex-smoker, Ex smoker, smoker, former-smoker, Former, current, … • While most Integer values are valid, there are many values that are plain text, for example: e;N/A, NO, UVPgt59.4, pig, JM52, stock_180.92, ... • Data types of attribute values are not enforced!
  17. 17. Ontology Term Attributes are Mostly Populated with Invalid Values 16 • Example values for the disease metadata attribute: No Adenomas, BrCa, presumed normal, no AD evident at demise, “NL smooth muscle, stomach rmvd as part of pancr., CA”, … • Example values for phenotype metadata attribute: unknown, monster, wild_type, none, 30 psu, “The 136 mutant has a shorter root meristem and a reduction in root length of about 75% compared to wild type”, …
  18. 18. BioSample Lacks Standardization of Metadata Attributes • The values for attributes defined by BioSample are not appropriately verified • Neither the attribute names nor (most of) their values rely on ontologies • To be FAIR, the metadata in BioSample would have to improve considerably 17
  19. 19. A Solution: The CEDAR Workbench See the CEDAR Resource Track paper & talk Tuesday - Session 6 - Biomedical and scientific applications 18
  20. 20. Questions See the CEDAR Resource Track paper & talk Tuesday - Session 6 - Biomedical and scientific applications

×