SlideShare a Scribd company logo
Metadata Analyser: measuring
metadata quality
Bruno Inácio, João D. Ferreira, and Francisco M. Couto
LaSIGE, Faculdade de Ciências da Universidade de Lisboa, Portugal
PACBB, June 21-23, 2017
Porto Portugal
Figure 1. Two pages (scan) from Galilei's Sidereus Nuncius (“The
Starry Messenger” or “The Herald of the Stars”), Venice, 1610.
Goodman A, Pepe A, Blocker AW, Borgman CL, et al. (2014) Ten Simple Rules for the Care and
Feeding of Scientific Data. PLoS Comput Biol 10(4): e1003542. doi:10.1371/journal.pcbi.1003542
http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003542
Galileo integrated
• the direct results of
his observations of
Jupiter
• with careful and
clear descriptions
of how they were
performed
From “Big” Data to Knowledge
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc= "http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Sintra_Collar">
<dc:description>
Gold collar. It was made from three circular sectioned and tapering gold bars
that are fused at the ends forming a penannular neck-ring.
</dc:description>
<dc:date>1250BC-800BC (circa)</dc:date>
<dc:location>
Sintra, Portugal
http://yboss.yahooapis.com/geo/placefinder?woeid=748874
</dc:location>
<dc:type>
Gold
http://purl.obolibrary.org/obo/CHEBI_30050
</dc:type>
</rdf:Description>
</rdf:RDF>
Metal
Silver
CoinagePrecious
Palladium GoldPlatinum Copper
is-a
mappings
Conventional Solution
proper data sharing rules
• So let’s create some
Data-sharing Policies
and some
Compliance and
Enforcement activities
Esperanto
• Created in 1887 as an easy-to-learn
• And politically neutral language
• But, English provides a greater incentive
– Websites
Languages,
March 2014
Data-sharing policies
“Adherence to data-sharing policies is as
inconsistent as the policies themselves”
“351 papers covered by some data-sharing policy,
only 143 fully adhered to that policy” (~40%)
“is time-consuming to do properly, the reward
systems aren't there and neither is the stick”
“Of all the data that are made available, what
fraction is actually used by someone else? “
Steven Wiley in Nature, 2011
http://www.nature.com/news/2011/110914/full/news.2011.536.html
Human Factor
• “More often than scientists would like to
admit, they cannot even recover the data
associated with their own published works”
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003542
Goals
1. propose two measures of metadata quality
2. to implement a tool that is able to evaluate
these measures in a public repository
3. to show that these measures are valid and
significant in a real-world scientific repository
Measures of metadata quality
1. Term coverage
the proportion of annotations in the metadata file
that link to an ontology concept
2. Semantic specificity
the average specificity of those ontology
concepts
Term Coverage
• It is the ratio between
– the number of annotations that refer to ontology
concepts
– and the total number of annotations in the
metadata file
Semantic specificity
• A(t) is the number of ascendant concepts up
from t
• and D(t) is the average distance between t and
all its leaf descendants
Metadata Analyser Architecture
1. An interface layer that interacts with the user by
requesting a metadata file, informing the user on the
analysis progress, and outputting the result
2. An application layer that analyses the metadata file
and evaluates the annotations found therein.
3. A data layer that holds the ontologies in local
databases
4. A web API layer that connects the interface layer to
the application layer, coded in commonly used web
technologies
Case Study: Metabolights
• a database of metabolomics experiments
• developed by the EBI since 2012
• Evaluation
– the measures on all the resources
– manually in a selection of resources
– metadata quality before and after a curation step
by experts
Manual Evaluation
Lower coverage: not all ontologies used to annotate
the resources were included in the local database
pre- and post-curation analysis
Human Factor
1. may not know the ontologies that contain the
concepts they need
2. do not fully know the structure of the ontologies
in order to perform annotation with the
appropriate specific terms
3. lack the proper skills to carry on the annotation
process because of the technical difficulties
associated with this task
4. do not consider data sharing to be relevant
5. consider that the cost of ensuring proper
semantic integration outweighs the benefits
Conclusions
• apparent correlation between specificity and
coverage
• a weak term coverage (average of 0.25)
• two proposed measures can effectively
measure the effort put into the semantic
annotation of digital resources
• Metadata Analyser
– a means to measure the quality of their metadata
– 10,000 times faster than the previous work
Acknowledgments
• The EBI team in charge of the development
and maintenance of metabolights for their
support in this study.
Software:
https://github.com/lasigeBioTM/MetadataAnalyser

More Related Content

What's hot

Nanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkNanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS Talk
Jean-Claude Bradley
 
Technology and Students: Mix, Match or Miss?
Technology and Students: Mix, Match or Miss?Technology and Students: Mix, Match or Miss?
Technology and Students: Mix, Match or Miss?
Jean-Claude Bradley
 
RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015
William Gunn
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
Carole Goble
 
Reproducibility and replicability: a practical approach
Reproducibility and replicability: a practical approachReproducibility and replicability: a practical approach
Reproducibility and replicability: a practical approach
Krzysztof Gorgolewski
 
Link Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataLink Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked Data
Michel Dumontier
 
2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis
Sean Davis
 
Peer Review and Science2.0
Peer Review and Science2.0Peer Review and Science2.0
Peer Review and Science2.0
Jean-Claude Bradley
 
Data analysis workflows part 1 2015
Data analysis workflows part 1 2015Data analysis workflows part 1 2015
Data analysis workflows part 1 2015
Dmitry Grapov
 
Insights from Knowledge Graphs
Insights from Knowledge GraphsInsights from Knowledge Graphs
Insights from Knowledge Graphs
Anirudh Prabhu
 
Data for AI models, the past, the present, the future
Data for AI models, the past, the present, the futureData for AI models, the past, the present, the future
Data for AI models, the past, the present, the future
Pistoia Alliance
 
Cheminfo Retrieval 2010 Class 1
Cheminfo Retrieval 2010 Class 1Cheminfo Retrieval 2010 Class 1
Cheminfo Retrieval 2010 Class 1
Jean-Claude Bradley
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
c.titus.brown
 
Roche_open_science_NIOO_KNAW_workshop_NL
Roche_open_science_NIOO_KNAW_workshop_NLRoche_open_science_NIOO_KNAW_workshop_NL
Roche_open_science_NIOO_KNAW_workshop_NL
Dominique Roche
 
Identifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsIdentifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world Events
Nattiya Kanhabua
 
Penn State Researchers Code Targets Stealthy Computer Worms
Penn State Researchers Code Targets Stealthy Computer WormsPenn State Researchers Code Targets Stealthy Computer Worms
Penn State Researchers Code Targets Stealthy Computer Worms
dgrinnell
 
Open Notebook Science HUBzero 2011
Open Notebook Science HUBzero 2011Open Notebook Science HUBzero 2011
Open Notebook Science HUBzero 2011
Jean-Claude Bradley
 
Containers in Science: neuroimaging use cases
Containers in Science: neuroimaging use casesContainers in Science: neuroimaging use cases
Containers in Science: neuroimaging use cases
Krzysztof Gorgolewski
 
openSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association StudiesopenSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association Studies
Bastian Greshake
 
Lifesavingcomputer a
Lifesavingcomputer aLifesavingcomputer a
Lifesavingcomputer a
Banchong Sotsi
 

What's hot (20)

Nanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkNanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS Talk
 
Technology and Students: Mix, Match or Miss?
Technology and Students: Mix, Match or Miss?Technology and Students: Mix, Match or Miss?
Technology and Students: Mix, Match or Miss?
 
RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 
Reproducibility and replicability: a practical approach
Reproducibility and replicability: a practical approachReproducibility and replicability: a practical approach
Reproducibility and replicability: a practical approach
 
Link Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataLink Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked Data
 
2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis
 
Peer Review and Science2.0
Peer Review and Science2.0Peer Review and Science2.0
Peer Review and Science2.0
 
Data analysis workflows part 1 2015
Data analysis workflows part 1 2015Data analysis workflows part 1 2015
Data analysis workflows part 1 2015
 
Insights from Knowledge Graphs
Insights from Knowledge GraphsInsights from Knowledge Graphs
Insights from Knowledge Graphs
 
Data for AI models, the past, the present, the future
Data for AI models, the past, the present, the futureData for AI models, the past, the present, the future
Data for AI models, the past, the present, the future
 
Cheminfo Retrieval 2010 Class 1
Cheminfo Retrieval 2010 Class 1Cheminfo Retrieval 2010 Class 1
Cheminfo Retrieval 2010 Class 1
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
Roche_open_science_NIOO_KNAW_workshop_NL
Roche_open_science_NIOO_KNAW_workshop_NLRoche_open_science_NIOO_KNAW_workshop_NL
Roche_open_science_NIOO_KNAW_workshop_NL
 
Identifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world EventsIdentifying Relevant Temporal Expressions for Real-world Events
Identifying Relevant Temporal Expressions for Real-world Events
 
Penn State Researchers Code Targets Stealthy Computer Worms
Penn State Researchers Code Targets Stealthy Computer WormsPenn State Researchers Code Targets Stealthy Computer Worms
Penn State Researchers Code Targets Stealthy Computer Worms
 
Open Notebook Science HUBzero 2011
Open Notebook Science HUBzero 2011Open Notebook Science HUBzero 2011
Open Notebook Science HUBzero 2011
 
Containers in Science: neuroimaging use cases
Containers in Science: neuroimaging use casesContainers in Science: neuroimaging use cases
Containers in Science: neuroimaging use cases
 
openSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association StudiesopenSNP - Crowdsourcing Genome Wide Association Studies
openSNP - Crowdsourcing Genome Wide Association Studies
 
Lifesavingcomputer a
Lifesavingcomputer aLifesavingcomputer a
Lifesavingcomputer a
 

Similar to Metadata Analyser: measuring metadata quality

Luciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsLuciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metrics
Joanne Luciano
 
Luciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsLuciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metrics
Joanne Luciano
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011
Rudy Potenzone
 
Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0
Elia Brodsky
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) Commons
James Hendler
 
The W3C PROV standard: data model for the provenance of information, and enab...
The W3C PROV standard:data model for the provenance of information, and enab...The W3C PROV standard:data model for the provenance of information, and enab...
The W3C PROV standard: data model for the provenance of information, and enab...
Paolo Missier
 
Bioinformatic core facilities discussion
Bioinformatic core facilities discussionBioinformatic core facilities discussion
Bioinformatic core facilities discussion
Jennifer Shelton
 
Martone grethe
Martone gretheMartone grethe
Martone grethe
Maryann Martone
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
Paul Groth
 
Cartegena051811
Cartegena051811Cartegena051811
Cartegena051811
Philip Bourne
 
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
National Information Standards Organization (NISO)
 
Open Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality DataOpen Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality Data
CGIAR Research Program on Dryland Systems
 
Data at the NIH: Some Early Thoughts
Data at the NIH: Some Early ThoughtsData at the NIH: Some Early Thoughts
Data at the NIH: Some Early Thoughts
Philip Bourne
 
WOW13_RPITWC_Web Observatories
WOW13_RPITWC_Web ObservatoriesWOW13_RPITWC_Web Observatories
WOW13_RPITWC_Web Observatories
gloriakt
 
Biomedical Research as Part of the Digital Enterprise
Biomedical Research as Part of the Digital EnterpriseBiomedical Research as Part of the Digital Enterprise
Biomedical Research as Part of the Digital Enterprise
Philip Bourne
 
informatics_future.pdf
informatics_future.pdfinformatics_future.pdf
informatics_future.pdf
AdhySugara2
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
William Gunn
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information Retrieval
Waqas Tariq
 
Mtsr2015 goble-keynote
Mtsr2015 goble-keynoteMtsr2015 goble-keynote
Mtsr2015 goble-keynote
Carole Goble
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
GigaScience, BGI Hong Kong
 

Similar to Metadata Analyser: measuring metadata quality (20)

Luciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsLuciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metrics
 
Luciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metricsLuciano pr 08-849_ontology_evaluation_methods_metrics
Luciano pr 08-849_ontology_evaluation_methods_metrics
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011
 
Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0Omics Logic - Bioinformatics 2.0
Omics Logic - Bioinformatics 2.0
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) Commons
 
The W3C PROV standard: data model for the provenance of information, and enab...
The W3C PROV standard:data model for the provenance of information, and enab...The W3C PROV standard:data model for the provenance of information, and enab...
The W3C PROV standard: data model for the provenance of information, and enab...
 
Bioinformatic core facilities discussion
Bioinformatic core facilities discussionBioinformatic core facilities discussion
Bioinformatic core facilities discussion
 
Martone grethe
Martone gretheMartone grethe
Martone grethe
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Cartegena051811
Cartegena051811Cartegena051811
Cartegena051811
 
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
NISO Working Group Connection Live! Research Data Metrics Landscape: An Updat...
 
Open Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality DataOpen Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality Data
 
Data at the NIH: Some Early Thoughts
Data at the NIH: Some Early ThoughtsData at the NIH: Some Early Thoughts
Data at the NIH: Some Early Thoughts
 
WOW13_RPITWC_Web Observatories
WOW13_RPITWC_Web ObservatoriesWOW13_RPITWC_Web Observatories
WOW13_RPITWC_Web Observatories
 
Biomedical Research as Part of the Digital Enterprise
Biomedical Research as Part of the Digital EnterpriseBiomedical Research as Part of the Digital Enterprise
Biomedical Research as Part of the Digital Enterprise
 
informatics_future.pdf
informatics_future.pdfinformatics_future.pdf
informatics_future.pdf
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information Retrieval
 
Mtsr2015 goble-keynote
Mtsr2015 goble-keynoteMtsr2015 goble-keynote
Mtsr2015 goble-keynote
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 

More from Francisco Couto

Master's Theses in Bioinformatics and Computational Biology
Master's Theses in Bioinformatics and Computational BiologyMaster's Theses in Bioinformatics and Computational Biology
Master's Theses in Bioinformatics and Computational Biology
Francisco Couto
 
Linked Data – challenges for Imagiology and Radiology
Linked Data – challenges for Imagiology and RadiologyLinked Data – challenges for Imagiology and Radiology
Linked Data – challenges for Imagiology and Radiology
Francisco Couto
 
MER: a Minimal Named-Entity Recognition Tagger and Annotation Server
MER: a Minimal Named-Entity Recognition Tagger and Annotation ServerMER: a Minimal Named-Entity Recognition Tagger and Annotation Server
MER: a Minimal Named-Entity Recognition Tagger and Annotation Server
Francisco Couto
 
Towards a privacy-preserving environment for genomic data analysis
Towards a privacy-preserving environment for genomic data analysisTowards a privacy-preserving environment for genomic data analysis
Towards a privacy-preserving environment for genomic data analysis
Francisco Couto
 
A Large-Scale Characterization of User Behaviour in Cable TV
A Large-Scale Characterization of User Behaviour in Cable TVA Large-Scale Characterization of User Behaviour in Cable TV
A Large-Scale Characterization of User Behaviour in Cable TV
Francisco Couto
 
A Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVA Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TV
Francisco Couto
 
Master in Bioinformatics and Computational Biology
Master in Bioinformatics and Computational BiologyMaster in Bioinformatics and Computational Biology
Master in Bioinformatics and Computational Biology
Francisco Couto
 
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
Francisco Couto
 
Bioinf2Bio Oportunidades
Bioinf2Bio OportunidadesBioinf2Bio Oportunidades
Bioinf2Bio Oportunidades
Francisco Couto
 
Stabvida oportunidades profissionais
Stabvida oportunidades profissionaisStabvida oportunidades profissionais
Stabvida oportunidades profissionais
Francisco Couto
 
Mestrado em Bioinformática e Biologia Computacional da FCUL
Mestrado em Bioinformática e Biologia Computacional da FCULMestrado em Bioinformática e Biologia Computacional da FCUL
Mestrado em Bioinformática e Biologia Computacional da FCUL
Francisco Couto
 

More from Francisco Couto (11)

Master's Theses in Bioinformatics and Computational Biology
Master's Theses in Bioinformatics and Computational BiologyMaster's Theses in Bioinformatics and Computational Biology
Master's Theses in Bioinformatics and Computational Biology
 
Linked Data – challenges for Imagiology and Radiology
Linked Data – challenges for Imagiology and RadiologyLinked Data – challenges for Imagiology and Radiology
Linked Data – challenges for Imagiology and Radiology
 
MER: a Minimal Named-Entity Recognition Tagger and Annotation Server
MER: a Minimal Named-Entity Recognition Tagger and Annotation ServerMER: a Minimal Named-Entity Recognition Tagger and Annotation Server
MER: a Minimal Named-Entity Recognition Tagger and Annotation Server
 
Towards a privacy-preserving environment for genomic data analysis
Towards a privacy-preserving environment for genomic data analysisTowards a privacy-preserving environment for genomic data analysis
Towards a privacy-preserving environment for genomic data analysis
 
A Large-Scale Characterization of User Behaviour in Cable TV
A Large-Scale Characterization of User Behaviour in Cable TVA Large-Scale Characterization of User Behaviour in Cable TV
A Large-Scale Characterization of User Behaviour in Cable TV
 
A Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TVA Flexible Recommendation System for Cable TV
A Flexible Recommendation System for Cable TV
 
Master in Bioinformatics and Computational Biology
Master in Bioinformatics and Computational BiologyMaster in Bioinformatics and Computational Biology
Master in Bioinformatics and Computational Biology
 
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
 
Bioinf2Bio Oportunidades
Bioinf2Bio OportunidadesBioinf2Bio Oportunidades
Bioinf2Bio Oportunidades
 
Stabvida oportunidades profissionais
Stabvida oportunidades profissionaisStabvida oportunidades profissionais
Stabvida oportunidades profissionais
 
Mestrado em Bioinformática e Biologia Computacional da FCUL
Mestrado em Bioinformática e Biologia Computacional da FCULMestrado em Bioinformática e Biologia Computacional da FCUL
Mestrado em Bioinformática e Biologia Computacional da FCUL
 

Recently uploaded

waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
Hitesh Sikarwar
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
terusbelajar5
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 

Recently uploaded (20)

waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 

Metadata Analyser: measuring metadata quality

  • 1. Metadata Analyser: measuring metadata quality Bruno Inácio, João D. Ferreira, and Francisco M. Couto LaSIGE, Faculdade de Ciências da Universidade de Lisboa, Portugal PACBB, June 21-23, 2017 Porto Portugal
  • 2. Figure 1. Two pages (scan) from Galilei's Sidereus Nuncius (“The Starry Messenger” or “The Herald of the Stars”), Venice, 1610. Goodman A, Pepe A, Blocker AW, Borgman CL, et al. (2014) Ten Simple Rules for the Care and Feeding of Scientific Data. PLoS Comput Biol 10(4): e1003542. doi:10.1371/journal.pcbi.1003542 http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003542 Galileo integrated • the direct results of his observations of Jupiter • with careful and clear descriptions of how they were performed From “Big” Data to Knowledge
  • 3. <?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc= "http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="http://en.wikipedia.org/wiki/Sintra_Collar"> <dc:description> Gold collar. It was made from three circular sectioned and tapering gold bars that are fused at the ends forming a penannular neck-ring. </dc:description> <dc:date>1250BC-800BC (circa)</dc:date> <dc:location> Sintra, Portugal http://yboss.yahooapis.com/geo/placefinder?woeid=748874 </dc:location> <dc:type> Gold http://purl.obolibrary.org/obo/CHEBI_30050 </dc:type> </rdf:Description> </rdf:RDF>
  • 5. Conventional Solution proper data sharing rules • So let’s create some Data-sharing Policies and some Compliance and Enforcement activities
  • 6. Esperanto • Created in 1887 as an easy-to-learn • And politically neutral language • But, English provides a greater incentive – Websites Languages, March 2014
  • 7. Data-sharing policies “Adherence to data-sharing policies is as inconsistent as the policies themselves” “351 papers covered by some data-sharing policy, only 143 fully adhered to that policy” (~40%) “is time-consuming to do properly, the reward systems aren't there and neither is the stick” “Of all the data that are made available, what fraction is actually used by someone else? “ Steven Wiley in Nature, 2011 http://www.nature.com/news/2011/110914/full/news.2011.536.html
  • 8. Human Factor • “More often than scientists would like to admit, they cannot even recover the data associated with their own published works” http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003542
  • 9. Goals 1. propose two measures of metadata quality 2. to implement a tool that is able to evaluate these measures in a public repository 3. to show that these measures are valid and significant in a real-world scientific repository
  • 10. Measures of metadata quality 1. Term coverage the proportion of annotations in the metadata file that link to an ontology concept 2. Semantic specificity the average specificity of those ontology concepts
  • 11. Term Coverage • It is the ratio between – the number of annotations that refer to ontology concepts – and the total number of annotations in the metadata file
  • 12. Semantic specificity • A(t) is the number of ascendant concepts up from t • and D(t) is the average distance between t and all its leaf descendants
  • 13. Metadata Analyser Architecture 1. An interface layer that interacts with the user by requesting a metadata file, informing the user on the analysis progress, and outputting the result 2. An application layer that analyses the metadata file and evaluates the annotations found therein. 3. A data layer that holds the ontologies in local databases 4. A web API layer that connects the interface layer to the application layer, coded in commonly used web technologies
  • 14. Case Study: Metabolights • a database of metabolomics experiments • developed by the EBI since 2012 • Evaluation – the measures on all the resources – manually in a selection of resources – metadata quality before and after a curation step by experts
  • 15.
  • 16.
  • 17.
  • 18. Manual Evaluation Lower coverage: not all ontologies used to annotate the resources were included in the local database
  • 20. Human Factor 1. may not know the ontologies that contain the concepts they need 2. do not fully know the structure of the ontologies in order to perform annotation with the appropriate specific terms 3. lack the proper skills to carry on the annotation process because of the technical difficulties associated with this task 4. do not consider data sharing to be relevant 5. consider that the cost of ensuring proper semantic integration outweighs the benefits
  • 21. Conclusions • apparent correlation between specificity and coverage • a weak term coverage (average of 0.25) • two proposed measures can effectively measure the effort put into the semantic annotation of digital resources • Metadata Analyser – a means to measure the quality of their metadata – 10,000 times faster than the previous work
  • 22. Acknowledgments • The EBI team in charge of the development and maintenance of metabolights for their support in this study. Software: https://github.com/lasigeBioTM/MetadataAnalyser