S100
Martínez-Romero, M., O’Connor, M. J., Shankar, R., Panahiazar,
M., Willrett, D., Egyedi, A. L., Gevaert, O., Graybeal, J., Musen, M. A.
Stanford University
Fast and Accurate Metadata Authoring
Using Ontology-Based Recommendations
What is metadata?
2AMIA 2017 | amia.org
• Data that describe data
• Crucial for:
• Finding experimental datasets online
• Understanding how the experiments were performed
• Reusing the data to perform new analyses
3AMIA 2017 | amia.org
4AMIA 2017 | amia.org
age
Age
AGE
`Age
age (after birth)
age (in years)
age (y)
age (year)
age (years)
Age (years)
Age (Years)
age (yr)
age (yr-old)
age (yrs)
Age (yrs)
age [y]
age [year]
age [years]
age in years
age of patient
Age of patient
age of subjects
age(years)
Age(years)
Age(yrs.)
Age, year
age, years
age, yrs
age.year
age_years
Poor metadata
5AMIA 2017 | amia.org
An analysis of metadata from NCBI’s BioSample
• 73% of “Boolean” values
• nonsmoker, former-smoker
• 26% of “integer” values
• JM52, UVPgt59.4, pig
• 68% of ontology terms
• presumed normal, wild_type
Gonçalves, R. S. et al. (2017). Metadata in the BioSample Online Repository are Impaired by Numerous
Anomalies. SemSci 2017 Workshop, co-located with ISWC 2017. Vienna, Austria.
Poor metadata
[Your presentation on this and next slides]
6AMIA 2017 | amia.org
Metadata authoring is hard
• A computational
platform for metadata
management
• Goal: Overcome the
impediments to creating
high-quality metadata
7AMIA 2017 | amia.org
Metadata template
Metadata template
8AMIA 2017 | amia.org
SUBMIT METADATAFILL IN METADATADESIGN TEMPLATE
Template Designer Metadata Editor
Template authors
(e.g., standards
committees)
Metadata authors
(e.g., scientists)
Metadata Repositorytemplate metadata
LINCS
Public Databases
https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-…
A sample study
Acute stress disorder
Stanford University
John Doe
Longitudinal
9AMIA 2017 | amia.org
We developed a metadata recommendation system
SUBMIT METADATAFILL IN METADATADESIGN TEMPLATE
Template Designer Metadata Editor
Template authors
(e.g., standards
committees)
Metadata authors
(e.g., scientists)
Metadata Repositorytemplate metadata
LINCS
Public Databases
https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-…
A sample study
Acute stress disorder
Stanford University
John Doe
Longitudinal
Metadata recommendation system
10AMIA 2017 | amia.org
Metadata Editor Metadata Repository
https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-…
A sample study
Acute stress disorder
Stanford University
John Doe
Longitudinal
analyze
existing metadata
generate
suggestions
1
23
store
metadata
Metadata Recommender
11AMIA 2017 | amia.org
Filling in a CEDAR template
12AMIA 2017 | amia.org
13AMIA 2017 | amia.org
14AMIA 2017 | amia.org
15AMIA 2017 | amia.org
Evaluation workflow
16AMIA 2017 | amia.org
BioSample
template
instances
(≈35K)
Annotated
BioSample
template
instances
(≈35K)
CEDAR
BioSample
template
Training
dataset
Test
dataset
Training
dataset
Evaluation
results
CEDAR Metadata
Repository
(1)
Preprocessing
and Ingestion
(2)
Semantic
annotation
(3) Training
(4) Testing &
Analysis
Test
dataset
Gene
Expression
metadata
Metadata
Recommender
20%
80%
80%
20%
Evaluation workflow
17AMIA 2017 | amia.org
BioSample
template
instances
(≈35K)
Annotated
BioSample
template
instances
(≈35K)
CEDAR
BioSample
template
Training
dataset
Test
dataset
Training
dataset
Evaluation
results
CEDAR Metadata
Repository
(1)
Preprocessing
and Ingestion
(2)
Semantic
annotation
(3) Training
(4) Testing &
Analysis
Test
dataset
Gene
Expression
metadata
Metadata
Recommender
20%
80%
80%
20%
Evaluation workflow
18AMIA 2017 | amia.org
BioSample
template
instances
(≈35K)
Annotated
BioSample
template
instances
(≈35K)
CEDAR
BioSample
template
Training
dataset
Test
dataset
Training
dataset
Evaluation
results
CEDAR Metadata
Repository
(1)
Preprocessing
and Ingestion
(2)
Semantic
annotation
(3) Training
(4) Testing &
Analysis
Test
dataset
Gene
Expression
metadata
Metadata
Recommender
20%
80%
80%
20%
Evaluation workflow
19AMIA 2017 | amia.org
BioSample
template
instances
(≈35K)
Annotated
BioSample
template
instances
(≈35K)
CEDAR
BioSample
template
Training
dataset
Test
dataset
Training
dataset
Evaluation
results
CEDAR Metadata
Repository
(1)
Preprocessing
and Ingestion
(2)
Semantic
annotation
(3) Training
(4) Testing &
Analysis
Test
dataset
Gene
Expression
metadata
Metadata
Recommender
20%
80%
80%
20%
• For “disease”, ”sex”,
and “tissue”
• Top 3 suggestions
Testing & Analysis
Compared suggested vs. expected metadata
Measure: Reciprocal Rank (RR). Appropriate to judge
systems that return a ranking of suggestions when there is only
a relevant result
20AMIA 2017 | amia.org
!"#$%&'#()	!(+,	(!!) =
1
1
Position of the expected result
in the ranking of suggestions
How is the RR calculated?
21AMIA 2017 | amia.org
Expected Suggested K
Reciprocal Rank
(RR)
asthma
1) asthma
2) lung cancer
3) respiratory disease
1 1/1
lymphoma
1) myeloma
2) lymphoma
3) acute myeloid leukemia
2 1/2
lung cancer
1) respiratory disease
2) asthma
3) lung cancer
3 1/3
Mean Reciprocal Rank (MRR) = (1/1 + 1/2 + 1/3) / 3 = 0.61
Results
22AMIA 2017 | amia.org
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
disease tissue sex
Baseline Metadata Recommender
MeanReciprocalRank(MRR)
On average:
• Metadata
Recommender = 0.77
• Baseline
(majority vote) = 0.31
Better performance with
respect to the baseline for:
• Fields with many
different values
• Templates with many
correlated fields
Summary
• We developed a metadata recommendation system
as part of an end-to-end system for metadata
management called CEDAR
• Generates context-sensitive suggestions in real time
• Incorporates both ontology-based and free-text
suggestions
23AMIA 2017 | amia.org
Summary
Our approach makes it easier for scientists to
generate high-quality metadata for experimental
datasets
• So that the datasets can be found, interpreted, and
reused
• Essential to ensure scientific reproducibility
24AMIA 2017 | amia.org
25AMIA 2017 | amia.org
facebook.com/metadatacenter
@metadatacenter
http://cedar.metadatacenter.org
Channel: Metadata Center
github.com/metadatacenter

Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations (AMIA 2017 Conference)

  • 1.
    S100 Martínez-Romero, M., O’Connor,M. J., Shankar, R., Panahiazar, M., Willrett, D., Egyedi, A. L., Gevaert, O., Graybeal, J., Musen, M. A. Stanford University Fast and Accurate Metadata Authoring Using Ontology-Based Recommendations
  • 2.
    What is metadata? 2AMIA2017 | amia.org • Data that describe data • Crucial for: • Finding experimental datasets online • Understanding how the experiments were performed • Reusing the data to perform new analyses
  • 3.
    3AMIA 2017 |amia.org
  • 4.
    4AMIA 2017 |amia.org age Age AGE `Age age (after birth) age (in years) age (y) age (year) age (years) Age (years) Age (Years) age (yr) age (yr-old) age (yrs) Age (yrs) age [y] age [year] age [years] age in years age of patient Age of patient age of subjects age(years) Age(years) Age(yrs.) Age, year age, years age, yrs age.year age_years Poor metadata
  • 5.
    5AMIA 2017 |amia.org An analysis of metadata from NCBI’s BioSample • 73% of “Boolean” values • nonsmoker, former-smoker • 26% of “integer” values • JM52, UVPgt59.4, pig • 68% of ontology terms • presumed normal, wild_type Gonçalves, R. S. et al. (2017). Metadata in the BioSample Online Repository are Impaired by Numerous Anomalies. SemSci 2017 Workshop, co-located with ISWC 2017. Vienna, Austria. Poor metadata
  • 6.
    [Your presentation onthis and next slides] 6AMIA 2017 | amia.org Metadata authoring is hard
  • 7.
    • A computational platformfor metadata management • Goal: Overcome the impediments to creating high-quality metadata 7AMIA 2017 | amia.org Metadata template Metadata template
  • 8.
    8AMIA 2017 |amia.org SUBMIT METADATAFILL IN METADATADESIGN TEMPLATE Template Designer Metadata Editor Template authors (e.g., standards committees) Metadata authors (e.g., scientists) Metadata Repositorytemplate metadata LINCS Public Databases https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… A sample study Acute stress disorder Stanford University John Doe Longitudinal
  • 9.
    9AMIA 2017 |amia.org We developed a metadata recommendation system SUBMIT METADATAFILL IN METADATADESIGN TEMPLATE Template Designer Metadata Editor Template authors (e.g., standards committees) Metadata authors (e.g., scientists) Metadata Repositorytemplate metadata LINCS Public Databases https://cedar.metadatacenter.org/templates/edit/https://repo.metadatacenter.org/templates/ab105771-564e-42a1-9be4-5a63891… https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… A sample study Acute stress disorder Stanford University John Doe Longitudinal
  • 10.
    Metadata recommendation system 10AMIA2017 | amia.org Metadata Editor Metadata Repository https://cedar.metadatacenter.org/instances/edit/https://repo.metadatacenter.org/template-instances/d4f1059e-8e27-4166-902f-… A sample study Acute stress disorder Stanford University John Doe Longitudinal analyze existing metadata generate suggestions 1 23 store metadata Metadata Recommender
  • 11.
    11AMIA 2017 |amia.org Filling in a CEDAR template
  • 12.
    12AMIA 2017 |amia.org
  • 13.
    13AMIA 2017 |amia.org
  • 14.
    14AMIA 2017 |amia.org
  • 15.
    15AMIA 2017 |amia.org
  • 16.
    Evaluation workflow 16AMIA 2017| amia.org BioSample template instances (≈35K) Annotated BioSample template instances (≈35K) CEDAR BioSample template Training dataset Test dataset Training dataset Evaluation results CEDAR Metadata Repository (1) Preprocessing and Ingestion (2) Semantic annotation (3) Training (4) Testing & Analysis Test dataset Gene Expression metadata Metadata Recommender 20% 80% 80% 20%
  • 17.
    Evaluation workflow 17AMIA 2017| amia.org BioSample template instances (≈35K) Annotated BioSample template instances (≈35K) CEDAR BioSample template Training dataset Test dataset Training dataset Evaluation results CEDAR Metadata Repository (1) Preprocessing and Ingestion (2) Semantic annotation (3) Training (4) Testing & Analysis Test dataset Gene Expression metadata Metadata Recommender 20% 80% 80% 20%
  • 18.
    Evaluation workflow 18AMIA 2017| amia.org BioSample template instances (≈35K) Annotated BioSample template instances (≈35K) CEDAR BioSample template Training dataset Test dataset Training dataset Evaluation results CEDAR Metadata Repository (1) Preprocessing and Ingestion (2) Semantic annotation (3) Training (4) Testing & Analysis Test dataset Gene Expression metadata Metadata Recommender 20% 80% 80% 20%
  • 19.
    Evaluation workflow 19AMIA 2017| amia.org BioSample template instances (≈35K) Annotated BioSample template instances (≈35K) CEDAR BioSample template Training dataset Test dataset Training dataset Evaluation results CEDAR Metadata Repository (1) Preprocessing and Ingestion (2) Semantic annotation (3) Training (4) Testing & Analysis Test dataset Gene Expression metadata Metadata Recommender 20% 80% 80% 20% • For “disease”, ”sex”, and “tissue” • Top 3 suggestions
  • 20.
    Testing & Analysis Comparedsuggested vs. expected metadata Measure: Reciprocal Rank (RR). Appropriate to judge systems that return a ranking of suggestions when there is only a relevant result 20AMIA 2017 | amia.org !"#$%&'#() !(+, (!!) = 1 1 Position of the expected result in the ranking of suggestions
  • 21.
    How is theRR calculated? 21AMIA 2017 | amia.org Expected Suggested K Reciprocal Rank (RR) asthma 1) asthma 2) lung cancer 3) respiratory disease 1 1/1 lymphoma 1) myeloma 2) lymphoma 3) acute myeloid leukemia 2 1/2 lung cancer 1) respiratory disease 2) asthma 3) lung cancer 3 1/3 Mean Reciprocal Rank (MRR) = (1/1 + 1/2 + 1/3) / 3 = 0.61
  • 22.
    Results 22AMIA 2017 |amia.org 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 disease tissue sex Baseline Metadata Recommender MeanReciprocalRank(MRR) On average: • Metadata Recommender = 0.77 • Baseline (majority vote) = 0.31 Better performance with respect to the baseline for: • Fields with many different values • Templates with many correlated fields
  • 23.
    Summary • We developeda metadata recommendation system as part of an end-to-end system for metadata management called CEDAR • Generates context-sensitive suggestions in real time • Incorporates both ontology-based and free-text suggestions 23AMIA 2017 | amia.org
  • 24.
    Summary Our approach makesit easier for scientists to generate high-quality metadata for experimental datasets • So that the datasets can be found, interpreted, and reused • Essential to ensure scientific reproducibility 24AMIA 2017 | amia.org
  • 25.
    25AMIA 2017 |amia.org facebook.com/metadatacenter @metadatacenter http://cedar.metadatacenter.org Channel: Metadata Center github.com/metadatacenter