Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations (AMIA 2017 Conference)

S100
Martínez-Romero, M., O’Connor, M. J., Shankar, R., Panahiazar,
M., Willrett, D., Egyedi, A. L., Gevaert, O., Graybeal, J., Musen, M. A.
Stanford University
Fast and Accurate Metadata Authoring
Using Ontology-Based Recommendations

What is metadata?
2AMIA 2017 | amia.org
• Data that describe data
• Crucial for:
• Finding experimental datasets online
• Understanding how the experiments were performed
• Reusing the data to perform new analyses

age
Age
AGE
`Age
age (after birth)
age (in years)
age (y)
age (year)
age (years)
Age (years)
Age (Years)
age (yr)
age (yr-old)
age (yrs)
Age (yrs)
age [y]
age [year]
age [years]
age in years
age of patient
Age of patient
age of subjects
age(years)
Age(years)
Age(yrs.)
Age, year
age, years
age, yrs
age.year
age_years
Poor metadata

An analysis of metadata from NCBI’s BioSample
• 73% of “Boolean” values
• nonsmoker, former-smoker
• 26% of “integer” values
• JM52, UVPgt59.4, pig
• 68% of ontology terms
• presumed normal, wild_type
Gonçalves, R. S. et al. (2017). Metadata in the BioSample Online Repository are Impaired by Numerous
Anomalies. SemSci 2017 Workshop, co-located with ISWC 2017. Vienna, Austria.
Poor metadata

[Your presentation on this and next slides]
Metadata authoring is hard

• A computational
platform for metadata
management
• Goal: Overcome the
impediments to creating
high-quality metadata
Metadata template
Metadata template

SUBMIT METADATAFILL IN METADATADESIGN TEMPLATE
Template Designer Metadata Editor
Template authors
(e.g., standards
committees)
Metadata authors
(e.g., scientists)
Metadata Repositorytemplate metadata
LINCS
Public Databases
https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-…
A sample study
Acute stress disorder
Stanford University
John Doe
Longitudinal

We developed a metadata recommendation system
SUBMIT METADATAFILL IN METADATADESIGN TEMPLATE
Template Designer Metadata Editor
Template authors
(e.g., standards
committees)
Metadata authors
(e.g., scientists)
Metadata Repositorytemplate metadata
LINCS
Public Databases
https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-…
A sample study
Stanford University
John Doe
Longitudinal

Metadata recommendation system
Metadata Editor Metadata Repository
https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-…
A sample study
Stanford University
John Doe
Longitudinal
analyze
existing metadata
generate
suggestions
1
23
store
metadata
Metadata Recommender

Filling in a CEDAR template

Evaluation workflow
BioSample
template
instances
(≈35K)
Annotated
BioSample
template
instances
(≈35K)
CEDAR
BioSample
template
Training
dataset
Test
dataset
Training
dataset
Evaluation
results
CEDAR Metadata
Repository
(1)
Preprocessing
and Ingestion
(2)
Semantic
annotation
(3) Training
(4) Testing &
Analysis
Test
dataset
Gene
Expression
metadata
Metadata
Recommender
20%
80%
80%
20%

Evaluation workflow
BioSample
template
instances
(≈35K)
Annotated
BioSample
template
instances
(≈35K)
CEDAR
BioSample
template
Training
dataset
Test
dataset
Training
dataset
Evaluation
results
CEDAR Metadata
Repository
(1)
Preprocessing
and Ingestion
(2)
Semantic
annotation
(3) Training
(4) Testing &
Analysis
Test
dataset
Gene
Expression
metadata
Metadata
Recommender
20%
80%
80%
20%
• For “disease”, ”sex”,
and “tissue”
• Top 3 suggestions

Testing & Analysis
Compared suggested vs. expected metadata
Measure: Reciprocal Rank (RR). Appropriate to judge
systems that return a ranking of suggestions when there is only
a relevant result
!"#$%&'#() !(+, (!!) =
1
1
Position of the expected result
in the ranking of suggestions

How is the RR calculated?
Expected Suggested K
Reciprocal Rank
(RR)
asthma
1) asthma
2) lung cancer
3) respiratory disease
1 1/1
lymphoma
1) myeloma
2) lymphoma
3) acute myeloid leukemia
2 1/2
lung cancer
1) respiratory disease
2) asthma
3) lung cancer
3 1/3
Mean Reciprocal Rank (MRR) = (1/1 + 1/2 + 1/3) / 3 = 0.61

Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
disease tissue sex
Baseline Metadata Recommender
MeanReciprocalRank(MRR)
On average:
• Metadata
Recommender = 0.77
• Baseline
(majority vote) = 0.31
Better performance with
respect to the baseline for:
• Fields with many
different values
• Templates with many
correlated fields

Summary
• We developed a metadata recommendation system
as part of an end-to-end system for metadata
management called CEDAR
• Generates context-sensitive suggestions in real time
• Incorporates both ontology-based and free-text
suggestions

Summary
Our approach makes it easier for scientists to
generate high-quality metadata for experimental
datasets
• So that the datasets can be found, interpreted, and
reused
• Essential to ensure scientific reproducibility

facebook.com/metadatacenter
@metadatacenter
http://cedar.metadatacenter.org
Channel: Metadata Center
github.com/metadatacenter

Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations (AMIA 2017 Conference)

Recommended

Recommended

More Related Content

Similar to Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations (AMIA 2017 Conference)

Similar to Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations (AMIA 2017 Conference) (20)

Recently uploaded

Recently uploaded (20)

Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations (AMIA 2017 Conference)