In biomedicine, high-quality metadata are crucial for finding experimental datasets, for understanding how experiments were performed, and for reproducing those experiments. Despite the recent focus on metadata, the quality of metadata available in public repositories continues to be extremely poor. A key difficulty is that the typical metadata acquisition process is time-consuming and error prone, with weak or nonexistent support for linking metadata to ontologies. There is a pressing need for methods and tools to speed up the metadata acquisition process and to increase the quality of metadata that are entered. In this talk, we describe a methodology and set of associated tools that we developed to address this challenge. A core component of this approach is a value recommendation framework that uses analysis of previously entered metadata and ontology-based metadata specifications to help users rapidly and accurately enter their metadata. We performed an initial evaluation of this approach using metadata from a public metadata repository.
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations (AMIA 2017 Conference)
1. S100
Martínez-Romero, M., O’Connor, M. J., Shankar, R., Panahiazar,
M., Willrett, D., Egyedi, A. L., Gevaert, O., Graybeal, J., Musen, M. A.
Stanford University
Fast and Accurate Metadata Authoring
Using Ontology-Based Recommendations
2. What is metadata?
2AMIA 2017 | amia.org
• Data that describe data
• Crucial for:
• Finding experimental datasets online
• Understanding how the experiments were performed
• Reusing the data to perform new analyses
4. 4AMIA 2017 | amia.org
age
Age
AGE
`Age
age (after birth)
age (in years)
age (y)
age (year)
age (years)
Age (years)
Age (Years)
age (yr)
age (yr-old)
age (yrs)
Age (yrs)
age [y]
age [year]
age [years]
age in years
age of patient
Age of patient
age of subjects
age(years)
Age(years)
Age(yrs.)
Age, year
age, years
age, yrs
age.year
age_years
Poor metadata
5. 5AMIA 2017 | amia.org
An analysis of metadata from NCBI’s BioSample
• 73% of “Boolean” values
• nonsmoker, former-smoker
• 26% of “integer” values
• JM52, UVPgt59.4, pig
• 68% of ontology terms
• presumed normal, wild_type
Gonçalves, R. S. et al. (2017). Metadata in the BioSample Online Repository are Impaired by Numerous
Anomalies. SemSci 2017 Workshop, co-located with ISWC 2017. Vienna, Austria.
Poor metadata
6. [Your presentation on this and next slides]
6AMIA 2017 | amia.org
Metadata authoring is hard
7. • A computational
platform for metadata
management
• Goal: Overcome the
impediments to creating
high-quality metadata
7AMIA 2017 | amia.org
Metadata template
Metadata template
8. 8AMIA 2017 | amia.org
SUBMIT METADATAFILL IN METADATADESIGN TEMPLATE
Template Designer Metadata Editor
Template authors
(e.g., standards
committees)
Metadata authors
(e.g., scientists)
Metadata Repositorytemplate metadata
LINCS
Public Databases
https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-…
A sample study
Acute stress disorder
Stanford University
John Doe
Longitudinal
9. 9AMIA 2017 | amia.org
We developed a metadata recommendation system
SUBMIT METADATAFILL IN METADATADESIGN TEMPLATE
Template Designer Metadata Editor
Template authors
(e.g., standards
committees)
Metadata authors
(e.g., scientists)
Metadata Repositorytemplate metadata
LINCS
Public Databases
https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-…
A sample study
Acute stress disorder
Stanford University
John Doe
Longitudinal
10. Metadata recommendation system
10AMIA 2017 | amia.org
Metadata Editor Metadata Repository
https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-…
A sample study
Acute stress disorder
Stanford University
John Doe
Longitudinal
analyze
existing metadata
generate
suggestions
1
23
store
metadata
Metadata Recommender
16. Evaluation workflow
16AMIA 2017 | amia.org
BioSample
template
instances
(≈35K)
Annotated
BioSample
template
instances
(≈35K)
CEDAR
BioSample
template
Training
dataset
Test
dataset
Training
dataset
Evaluation
results
CEDAR Metadata
Repository
(1)
Preprocessing
and Ingestion
(2)
Semantic
annotation
(3) Training
(4) Testing &
Analysis
Test
dataset
Gene
Expression
metadata
Metadata
Recommender
20%
80%
80%
20%
17. Evaluation workflow
17AMIA 2017 | amia.org
BioSample
template
instances
(≈35K)
Annotated
BioSample
template
instances
(≈35K)
CEDAR
BioSample
template
Training
dataset
Test
dataset
Training
dataset
Evaluation
results
CEDAR Metadata
Repository
(1)
Preprocessing
and Ingestion
(2)
Semantic
annotation
(3) Training
(4) Testing &
Analysis
Test
dataset
Gene
Expression
metadata
Metadata
Recommender
20%
80%
80%
20%
18. Evaluation workflow
18AMIA 2017 | amia.org
BioSample
template
instances
(≈35K)
Annotated
BioSample
template
instances
(≈35K)
CEDAR
BioSample
template
Training
dataset
Test
dataset
Training
dataset
Evaluation
results
CEDAR Metadata
Repository
(1)
Preprocessing
and Ingestion
(2)
Semantic
annotation
(3) Training
(4) Testing &
Analysis
Test
dataset
Gene
Expression
metadata
Metadata
Recommender
20%
80%
80%
20%
19. Evaluation workflow
19AMIA 2017 | amia.org
BioSample
template
instances
(≈35K)
Annotated
BioSample
template
instances
(≈35K)
CEDAR
BioSample
template
Training
dataset
Test
dataset
Training
dataset
Evaluation
results
CEDAR Metadata
Repository
(1)
Preprocessing
and Ingestion
(2)
Semantic
annotation
(3) Training
(4) Testing &
Analysis
Test
dataset
Gene
Expression
metadata
Metadata
Recommender
20%
80%
80%
20%
• For “disease”, ”sex”,
and “tissue”
• Top 3 suggestions
20. Testing & Analysis
Compared suggested vs. expected metadata
Measure: Reciprocal Rank (RR). Appropriate to judge
systems that return a ranking of suggestions when there is only
a relevant result
20AMIA 2017 | amia.org
!"#$%&'#() !(+, (!!) =
1
1
Position of the expected result
in the ranking of suggestions
21. How is the RR calculated?
21AMIA 2017 | amia.org
Expected Suggested K
Reciprocal Rank
(RR)
asthma
1) asthma
2) lung cancer
3) respiratory disease
1 1/1
lymphoma
1) myeloma
2) lymphoma
3) acute myeloid leukemia
2 1/2
lung cancer
1) respiratory disease
2) asthma
3) lung cancer
3 1/3
Mean Reciprocal Rank (MRR) = (1/1 + 1/2 + 1/3) / 3 = 0.61
22. Results
22AMIA 2017 | amia.org
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
disease tissue sex
Baseline Metadata Recommender
MeanReciprocalRank(MRR)
On average:
• Metadata
Recommender = 0.77
• Baseline
(majority vote) = 0.31
Better performance with
respect to the baseline for:
• Fields with many
different values
• Templates with many
correlated fields
23. Summary
• We developed a metadata recommendation system
as part of an end-to-end system for metadata
management called CEDAR
• Generates context-sensitive suggestions in real time
• Incorporates both ontology-based and free-text
suggestions
23AMIA 2017 | amia.org
24. Summary
Our approach makes it easier for scientists to
generate high-quality metadata for experimental
datasets
• So that the datasets can be found, interpreted, and
reused
• Essential to ensure scientific reproducibility
24AMIA 2017 | amia.org