The metadata about scientific experiments are crucial for finding, reproducing, and reusing the data that the metadata describe. We present a study of the quality of the metadata stored in BioSample—a repository of metadata about samples used in biomedical experiments managed by the U.S. National Center for Biomedical Technology Information (NCBI). We tested whether 6.6 million BioSample metadata records are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the analyzed metadata. The BioSample metadata field names and their values are not standardized or controlled—15% of the metadata fields use field names not specified in the BioSample data dictionary. Only 9 out of 452 BioSample-specified fields ordinarily require ontology terms as values, and the quality of these controlled fields is better than that of uncontrolled ones, as even simple binary or numeric fields are often populated with inadequate values of different data types (e.g., only 27% of Boolean values are valid). Overall, the metadata in BioSample reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The aberrancies in the metadata are likely to impede search and secondary use of the associated datasets.
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalies (SemSci 2017 Workshop)
1. Metadata in the BioSample
Repository are Impaired by
Numerous Anomalies
Rafael Gonçalves
Stanford University
2. Metadata Are Essential In Science
• Metadata are crucial for finding, reproducing,
and reusing the data that they describe
• The FAIR data principles specify desirable
criteria that metadata and their datasets should
meet to be Findable, Accessible, Interoperable,
and Reusable
• For metadata to be interoperable, they should
rely on controlled terms from ontologies
1
4. The NCBI BioSample Metadata
Repository (2011-)
3
• BioSample stores metadata that describe
biological materials (samples) under investigation
• BioSample was designed to standardize
descriptions of samples for all NCBI repositories
• Our BioSample dump contains 6,615,347 records
• Collected on June 25th, 2017
7. Design of Metadata Quality Study
6
Ø Do attribute names correspond to
ontology terms?
Ø Are the attribute names used in
metadata records specified by BioSample?
8. Design of Metadata Quality Study
7
Ø Are the attribute values valid
according to their specification?
E.g., is the value of a numeric
attribute truly a number?
9. BioSample Metadata Are
Categorized Into Packages
• BioSample provides specifications of 104 packages
• E.g., Human, Microbe, Virus, Plant, Pathogen, etc.
• A package specifies the set of mandatory and optional
attributes that should be used to describe samples
8
BioSample
Metadata Record
BioSample
Metadata
Repository
contains
adheres to
BioSample
Package
1
1
11. Metadata Records Define
Multiple Attributes
• Each attribute (name-value pair) represents a
characteristic of a sample
• BioSample specifies a dictionary of attributes
• 452 attribute names and their expected value types
• Users can provide attributes with arbitrary names
10
BioSample
Metadata Record
BioSample
Attribute
defines
Attribute
Name
Attribute
Value
BioSample
Metadata
Repository
contains
adheres to
BioSample
Package
composed
of
1
* 1 1
1
1
12. Attribute Types Under Analysis
• Integer - require values that are integers
• Boolean - require values that are Booleans
• Value set - take on values from value sets defined in
the BioSample documentation
• Ontology term - take on term values from specific
ontologies
11
16. Simple Fields Have a Wide Range
of Values
15
• Boolean-type attributes have many values that do not
parse into Booleans
• For example, for the smoker attribute, there are such diverse
values as: Non-smoker, nonsmoker, non smoker, ex-smoker, Ex
smoker, smoker, former-smoker, Former, current, …
• While most Integer values are valid, there are many
values that are plain text, for example: e;N/A, NO,
UVPgt59.4, pig, JM52, stock_180.92, ...
• Data types of attribute values are not enforced!
17. Ontology Term Attributes are Mostly
Populated with Invalid Values
16
• Example values for the disease metadata attribute: No
Adenomas, BrCa, presumed normal, no AD evident at
demise, “NL smooth muscle, stomach rmvd as part of
pancr., CA”, …
• Example values for phenotype metadata attribute:
unknown, monster, wild_type, none, 30 psu, “The 136
mutant has a shorter root meristem and a reduction in
root length of about 75% compared to wild type”, …
18. BioSample Lacks Standardization
of Metadata Attributes
• The values for attributes defined by BioSample
are not appropriately verified
• Neither the attribute names nor (most of) their
values rely on ontologies
• To be FAIR, the metadata in BioSample would
have to improve considerably
17
19. A Solution: The CEDAR Workbench
See the CEDAR Resource Track paper & talk
Tuesday - Session 6 - Biomedical and scientific applications
18
20. Questions
See the CEDAR Resource Track paper & talk
Tuesday - Session 6 - Biomedical and scientific applications