SlideShare a Scribd company logo
1 of 18
EFO tools – the good, the great, and the evil Tomasz Adamusiak MD PhD
Huge ontology developed by a tiny team
We have means to assign blame when things go wrong (definition_editor)
We need richness and consistency for EFO based query expansion
New terms come from GXA and external users GXA Zooma OLS BioPortal similarity_match.pl 	OWL::Simple::Parser 	MeSH::Parser::ASCII
Xrefs are acquired by lexical cross-match to other ontologies similarity_match.pl 	OWL::Simple::Parser 	MeSH::Parser::ASCII
Definitions and synonyms are pulled in from external ontologies via NCBO BioPortal + provenance  BioPortal metadata xrefs BioportalImporter
Regression testing is essential as these are massive updates
We need better concept recognition because clean_ontology_terms.plis evil
We need fuzzines, because input data is extremely dirty
There are different levels of fuzziness similarity_match.pl metaphone & double metaphone Levenhstein distance n-grams clean_ontology_terms.pl
N-grams is a simple and relatively unknown method of string approximation
N-grams are extremely effective in practice Thequickbrown fox A. brownquickThe fox B. The quiet swine flu 18%                                                                    90% 19%                                                                    40%
The King is dead. Long live the Queen. 
OntoCAT is a great success and generated a lot of interest within the community
Natalja & Misha hit the mother lode
Which diseases affect heart components? Kurbatova N et al. Bioinformatics 2011;27:2468-2470
Acknowledgments Morris A. Swertz’s group at the Genomics Coordination Center (GCC), University of Groningen K Joeri van derVelde DespoinaAntonakaki Dasha Zhernakova James Malone Helen Parkinson Emma Hastings NiranAbeygunawardena Ele Holloway Tim Rayner Zooma: Tony Burdett Bioconductor/R package: Natalja Kurbatova, Pavel Kurnosov, Misha Kapushesky This work was supported by the European Community's Seventh Framework Programmes GEN2PHEN [grant number 200754], SLING [grant number 226073], and SYBARIS [grant number 242220], the European Molecular Biology Laboratory, the Netherlands Organisation for Scientific Research [NWO/Rubicon grant number 825.09.008], and the Netherlands Bioinformatics Centre [BioAssist/Biobanking platform and BioRange grant SP1.2.3] OntoCAT logo courtesy of Eamonn Maguire Special thanks go to NCBO BioPortal and EBI OLS support teams for all the comprehensive help they provide

More Related Content

What's hot

Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themRoss Mounce
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsmikaelhuss
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsmikaelhuss
 
Text and Data Mining explained at FTDM
Text and Data Mining explained at FTDMText and Data Mining explained at FTDM
Text and Data Mining explained at FTDMpetermurrayrust
 
Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningRoss Mounce
 
Aspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth ScienceAspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth ScienceRaul Palma
 
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksResults Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
 
Gcc talk baltimore july 2014
Gcc talk baltimore july 2014Gcc talk baltimore july 2014
Gcc talk baltimore july 2014pratikomics
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature TheContentMine
 
Looking for Data: Finding New Science
Looking for Data: Finding New ScienceLooking for Data: Finding New Science
Looking for Data: Finding New ScienceAnita de Waard
 
schema.org and biomedical ontologies
schema.org and biomedical ontologies schema.org and biomedical ontologies
schema.org and biomedical ontologies Simon Jupp
 
Science Commons Open Notebook Science Talk
Science Commons Open Notebook Science TalkScience Commons Open Notebook Science Talk
Science Commons Open Notebook Science TalkJean-Claude Bradley
 
EMBL Australia Bioinformatics Resource BioInfoSummer 2016
EMBL Australia Bioinformatics Resource BioInfoSummer 2016EMBL Australia Bioinformatics Resource BioInfoSummer 2016
EMBL Australia Bioinformatics Resource BioInfoSummer 2016Philippa Griffin
 
Opportunities in chemical structure standardization
Opportunities in chemical structure standardizationOpportunities in chemical structure standardization
Opportunities in chemical structure standardizationValery Tkachenko
 

What's hot (17)

Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on them
 
Paul Groth
Paul GrothPaul Groth
Paul Groth
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
When the world beats a path to your door. Collaboration in the era of big data
When the world beats a path to your door. Collaboration in the era of big dataWhen the world beats a path to your door. Collaboration in the era of big data
When the world beats a path to your door. Collaboration in the era of big data
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
Text and Data Mining explained at FTDM
Text and Data Mining explained at FTDMText and Data Mining explained at FTDM
Text and Data Mining explained at FTDM
 
Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data mining
 
Aspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth ScienceAspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth Science
 
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksResults Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
 
Gcc talk baltimore july 2014
Gcc talk baltimore july 2014Gcc talk baltimore july 2014
Gcc talk baltimore july 2014
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 
Looking for Data: Finding New Science
Looking for Data: Finding New ScienceLooking for Data: Finding New Science
Looking for Data: Finding New Science
 
schema.org and biomedical ontologies
schema.org and biomedical ontologies schema.org and biomedical ontologies
schema.org and biomedical ontologies
 
Science Commons Open Notebook Science Talk
Science Commons Open Notebook Science TalkScience Commons Open Notebook Science Talk
Science Commons Open Notebook Science Talk
 
EMBL Australia Bioinformatics Resource BioInfoSummer 2016
EMBL Australia Bioinformatics Resource BioInfoSummer 2016EMBL Australia Bioinformatics Resource BioInfoSummer 2016
EMBL Australia Bioinformatics Resource BioInfoSummer 2016
 
Use of data
Use of dataUse of data
Use of data
 
Opportunities in chemical structure standardization
Opportunities in chemical structure standardizationOpportunities in chemical structure standardization
Opportunities in chemical structure standardization
 

Similar to EFO tools - the good, the great, and the evil

Center for Clinical Genomics and Personalized Medicine, Hungary
Center for Clinical Genomics and Personalized Medicine, HungaryCenter for Clinical Genomics and Personalized Medicine, Hungary
Center for Clinical Genomics and Personalized Medicine, HungaryBalint L. Balint
 
2011 12 08 - LOINC Introduction
2011 12 08 - LOINC Introduction2011 12 08 - LOINC Introduction
2011 12 08 - LOINC Introductiondvreeman
 
HEVnet: Sharing sequences & metadata of hepatitis E virus
HEVnet: Sharing sequences & metadata of hepatitis E virus HEVnet: Sharing sequences & metadata of hepatitis E virus
HEVnet: Sharing sequences & metadata of hepatitis E virus AgnethaRIVM1
 
OntoCAT - integrated programming toolkit for common ontology application task...
OntoCAT - integrated programming toolkit for common ontology application task...OntoCAT - integrated programming toolkit for common ontology application task...
OntoCAT - integrated programming toolkit for common ontology application task...Tomasz Adamusiak
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.Elena Sügis
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchEuropean Bioinformatics Institute
 
Ontologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinOntologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinSimon Jupp
 
Scott Edmunds talk at ODHK.meet.26: Open Science Data = Open Data (a rant in ...
Scott Edmunds talk at ODHK.meet.26: Open Science Data = Open Data (a rant in ...Scott Edmunds talk at ODHK.meet.26: Open Science Data = Open Data (a rant in ...
Scott Edmunds talk at ODHK.meet.26: Open Science Data = Open Data (a rant in ...Scott Edmunds
 
The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960mare34
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsDuncan Hull
 
The Ondex Data Integration Framework
The Ondex Data Integration FrameworkThe Ondex Data Integration Framework
The Ondex Data Integration Frameworkbosc
 
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...Human Variome Project
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015Torsten Seemann
 
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesApollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesMonica Munoz-Torres
 
Connecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnected Data World
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppSimon Jupp
 
Pathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & BlockchainPathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & BlockchainNatalio Krasnogor
 

Similar to EFO tools - the good, the great, and the evil (20)

Center for Clinical Genomics and Personalized Medicine, Hungary
Center for Clinical Genomics and Personalized Medicine, HungaryCenter for Clinical Genomics and Personalized Medicine, Hungary
Center for Clinical Genomics and Personalized Medicine, Hungary
 
2011 12 08 - LOINC Introduction
2011 12 08 - LOINC Introduction2011 12 08 - LOINC Introduction
2011 12 08 - LOINC Introduction
 
HEVnet: Sharing sequences & metadata of hepatitis E virus
HEVnet: Sharing sequences & metadata of hepatitis E virus HEVnet: Sharing sequences & metadata of hepatitis E virus
HEVnet: Sharing sequences & metadata of hepatitis E virus
 
OntoCAT - integrated programming toolkit for common ontology application task...
OntoCAT - integrated programming toolkit for common ontology application task...OntoCAT - integrated programming toolkit for common ontology application task...
OntoCAT - integrated programming toolkit for common ontology application task...
 
A moqrich
A moqrichA moqrich
A moqrich
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 
Ontologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinOntologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlin
 
Scott Edmunds talk at ODHK.meet.26: Open Science Data = Open Data (a rant in ...
Scott Edmunds talk at ODHK.meet.26: Open Science Data = Open Data (a rant in ...Scott Edmunds talk at ODHK.meet.26: Open Science Data = Open Data (a rant in ...
Scott Edmunds talk at ODHK.meet.26: Open Science Data = Open Data (a rant in ...
 
The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of Bioinformatics
 
The Ondex Data Integration Framework
The Ondex Data Integration FrameworkThe Ondex Data Integration Framework
The Ondex Data Integration Framework
 
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
 
AJH CV sept2016
AJH CV sept2016AJH CV sept2016
AJH CV sept2016
 
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesApollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
 
Connecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics Institute
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-jupp
 
G03-SemanticWeb-OntoCAT
G03-SemanticWeb-OntoCATG03-SemanticWeb-OntoCAT
G03-SemanticWeb-OntoCAT
 
Pathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & BlockchainPathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & Blockchain
 

More from Tomasz Adamusiak

Accelerate AI | Knowledge Graphs in Financial Technology - Future or Hype
Accelerate AI | Knowledge Graphs in Financial Technology - Future or HypeAccelerate AI | Knowledge Graphs in Financial Technology - Future or Hype
Accelerate AI | Knowledge Graphs in Financial Technology - Future or HypeTomasz Adamusiak
 
Healthcare Standards? What a Concept!
Healthcare Standards? What a Concept!Healthcare Standards? What a Concept!
Healthcare Standards? What a Concept!Tomasz Adamusiak
 
Connecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked DataConnecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked DataTomasz Adamusiak
 
EHR-based Phenome Wide Association Study in Pancreatic Cancer
EHR-based Phenome Wide Association Study in Pancreatic CancerEHR-based Phenome Wide Association Study in Pancreatic Cancer
EHR-based Phenome Wide Association Study in Pancreatic CancerTomasz Adamusiak
 
Creating Dynamic Groupers Using Overrepresentation of Clinical Terms
Creating Dynamic Groupers Using Overrepresentation of Clinical TermsCreating Dynamic Groupers Using Overrepresentation of Clinical Terms
Creating Dynamic Groupers Using Overrepresentation of Clinical TermsTomasz Adamusiak
 
Semantic Interoperability in Health Information Exchange
Semantic Interoperability in Health Information ExchangeSemantic Interoperability in Health Information Exchange
Semantic Interoperability in Health Information ExchangeTomasz Adamusiak
 
Re-identification of de-identified PHI date elements
Re-identification of de-identified PHI date elementsRe-identification of de-identified PHI date elements
Re-identification of de-identified PHI date elementsTomasz Adamusiak
 
Medication Reconciliation in Electronic Health Information Exchange
Medication Reconciliation in Electronic Health Information ExchangeMedication Reconciliation in Electronic Health Information Exchange
Medication Reconciliation in Electronic Health Information ExchangeTomasz Adamusiak
 
Integrating SNOMED CT with other Meaningful Use vocabulary standards (LOINC, ...
Integrating SNOMED CT with other Meaningful Use vocabulary standards (LOINC, ...Integrating SNOMED CT with other Meaningful Use vocabulary standards (LOINC, ...
Integrating SNOMED CT with other Meaningful Use vocabulary standards (LOINC, ...Tomasz Adamusiak
 
Next-generation phenotyping using UMLS and Meaningful Use ontologies: SNOMED ...
Next-generation phenotyping using UMLS and Meaningful Use ontologies: SNOMED ...Next-generation phenotyping using UMLS and Meaningful Use ontologies: SNOMED ...
Next-generation phenotyping using UMLS and Meaningful Use ontologies: SNOMED ...Tomasz Adamusiak
 
Quality Assurance in LOINC® using Description Logic
Quality Assurance in LOINC® using Description LogicQuality Assurance in LOINC® using Description Logic
Quality Assurance in LOINC® using Description LogicTomasz Adamusiak
 
Unifying ontology services for functional genomic annotations
Unifying ontology services for functional genomic annotationsUnifying ontology services for functional genomic annotations
Unifying ontology services for functional genomic annotationsTomasz Adamusiak
 

More from Tomasz Adamusiak (12)

Accelerate AI | Knowledge Graphs in Financial Technology - Future or Hype
Accelerate AI | Knowledge Graphs in Financial Technology - Future or HypeAccelerate AI | Knowledge Graphs in Financial Technology - Future or Hype
Accelerate AI | Knowledge Graphs in Financial Technology - Future or Hype
 
Healthcare Standards? What a Concept!
Healthcare Standards? What a Concept!Healthcare Standards? What a Concept!
Healthcare Standards? What a Concept!
 
Connecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked DataConnecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked Data
 
EHR-based Phenome Wide Association Study in Pancreatic Cancer
EHR-based Phenome Wide Association Study in Pancreatic CancerEHR-based Phenome Wide Association Study in Pancreatic Cancer
EHR-based Phenome Wide Association Study in Pancreatic Cancer
 
Creating Dynamic Groupers Using Overrepresentation of Clinical Terms
Creating Dynamic Groupers Using Overrepresentation of Clinical TermsCreating Dynamic Groupers Using Overrepresentation of Clinical Terms
Creating Dynamic Groupers Using Overrepresentation of Clinical Terms
 
Semantic Interoperability in Health Information Exchange
Semantic Interoperability in Health Information ExchangeSemantic Interoperability in Health Information Exchange
Semantic Interoperability in Health Information Exchange
 
Re-identification of de-identified PHI date elements
Re-identification of de-identified PHI date elementsRe-identification of de-identified PHI date elements
Re-identification of de-identified PHI date elements
 
Medication Reconciliation in Electronic Health Information Exchange
Medication Reconciliation in Electronic Health Information ExchangeMedication Reconciliation in Electronic Health Information Exchange
Medication Reconciliation in Electronic Health Information Exchange
 
Integrating SNOMED CT with other Meaningful Use vocabulary standards (LOINC, ...
Integrating SNOMED CT with other Meaningful Use vocabulary standards (LOINC, ...Integrating SNOMED CT with other Meaningful Use vocabulary standards (LOINC, ...
Integrating SNOMED CT with other Meaningful Use vocabulary standards (LOINC, ...
 
Next-generation phenotyping using UMLS and Meaningful Use ontologies: SNOMED ...
Next-generation phenotyping using UMLS and Meaningful Use ontologies: SNOMED ...Next-generation phenotyping using UMLS and Meaningful Use ontologies: SNOMED ...
Next-generation phenotyping using UMLS and Meaningful Use ontologies: SNOMED ...
 
Quality Assurance in LOINC® using Description Logic
Quality Assurance in LOINC® using Description LogicQuality Assurance in LOINC® using Description Logic
Quality Assurance in LOINC® using Description Logic
 
Unifying ontology services for functional genomic annotations
Unifying ontology services for functional genomic annotationsUnifying ontology services for functional genomic annotations
Unifying ontology services for functional genomic annotations
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 

EFO tools - the good, the great, and the evil

  • 1. EFO tools – the good, the great, and the evil Tomasz Adamusiak MD PhD
  • 2. Huge ontology developed by a tiny team
  • 3. We have means to assign blame when things go wrong (definition_editor)
  • 4. We need richness and consistency for EFO based query expansion
  • 5. New terms come from GXA and external users GXA Zooma OLS BioPortal similarity_match.pl OWL::Simple::Parser MeSH::Parser::ASCII
  • 6. Xrefs are acquired by lexical cross-match to other ontologies similarity_match.pl OWL::Simple::Parser MeSH::Parser::ASCII
  • 7. Definitions and synonyms are pulled in from external ontologies via NCBO BioPortal + provenance BioPortal metadata xrefs BioportalImporter
  • 8. Regression testing is essential as these are massive updates
  • 9. We need better concept recognition because clean_ontology_terms.plis evil
  • 10. We need fuzzines, because input data is extremely dirty
  • 11. There are different levels of fuzziness similarity_match.pl metaphone & double metaphone Levenhstein distance n-grams clean_ontology_terms.pl
  • 12. N-grams is a simple and relatively unknown method of string approximation
  • 13. N-grams are extremely effective in practice Thequickbrown fox A. brownquickThe fox B. The quiet swine flu 18% 90% 19% 40%
  • 14. The King is dead. Long live the Queen. 
  • 15. OntoCAT is a great success and generated a lot of interest within the community
  • 16. Natalja & Misha hit the mother lode
  • 17. Which diseases affect heart components? Kurbatova N et al. Bioinformatics 2011;27:2468-2470
  • 18. Acknowledgments Morris A. Swertz’s group at the Genomics Coordination Center (GCC), University of Groningen K Joeri van derVelde DespoinaAntonakaki Dasha Zhernakova James Malone Helen Parkinson Emma Hastings NiranAbeygunawardena Ele Holloway Tim Rayner Zooma: Tony Burdett Bioconductor/R package: Natalja Kurbatova, Pavel Kurnosov, Misha Kapushesky This work was supported by the European Community's Seventh Framework Programmes GEN2PHEN [grant number 200754], SLING [grant number 226073], and SYBARIS [grant number 242220], the European Molecular Biology Laboratory, the Netherlands Organisation for Scientific Research [NWO/Rubicon grant number 825.09.008], and the Netherlands Bioinformatics Centre [BioAssist/Biobanking platform and BioRange grant SP1.2.3] OntoCAT logo courtesy of Eamonn Maguire Special thanks go to NCBO BioPortal and EBI OLS support teams for all the comprehensive help they provide

Editor's Notes

  1. Experimental Factor Ontology is a great application ontology, hugely popular among internal and external collaborators and featured among the top 10 most accessed ontologies within NCBO BioPortal, which provides access to hundreds of different ontology resources. It is a pleasure to be involved in this project.
  2. I joined the EFO teamaround January 2008 working in parallel to GEN2PHEN, to which some of this work was fed back into. My first task was designing and implementing a workflow for pulling in metadata (synonyms & definitions) from for xrefed ontology terms in external ontologies. We now have nearly 5,000 classes and 20,000 synonyms and there’s steady continuing growth.
  3. Venn diagram representing who edited/added which class. In cases where it overlaps, the same class was touched by more than one person. Three people directly interact with the ontology Helen, James and I. Ele and Jie would submit large term requests, so added those classes indirectly through any of us.
  4. This is how we’re leveraging all the rich metadata within the ontology. Here is an example of querying ArrayExpress http://www.ebi.ac.uk/arrayexpress/ for CML, and getting all experiments also annotated with chronic myloidleukemia and chronic myelogenousleukemia. Querying for leukemia or blood cancer would also give you this results. Anything inconsistent in the ontology would negatively influence this outcome.
  5. Here’s a typical workflow. Annotations unmapped to EFO in the Gene Expression Atlas (http://www.ebi.ac.uk/gxa/) are discovered by Zooma (zooma.sf.net). Zooma in turn verifies whether there is a pre-existing mapping within the Atlas already, if not tries to map it to EFO or other ontologies in OLS and BioPortal via OntoCAT. The output is the fed into similarity_match.pl script to double check that no similar terms are in EFO already (as Zooma performs only exact matching) and the vetted terms are finally added to EFO via James’ tab_to_owl script or manually.Another sources of new terms is external users requests. They usually supply a flat list of terms they would like to see within the ontology. These are then mapped via similarity_match.pl to check whether they’re already in EFO, and the added.similarlity_match.pl has custom dedicated dependencies for parsing OWL ontologies and MeSH.
  6. Before metadata from external resources can be imported into EFO we need to add appropriate xrefs. These are stored in a dedication annotation ‘definition_citation’ on the mapped term within EFO. The xrefs are added discovered by using similarity_match.pl to align other ontologies (e.g. MeSH, OMIM, NCI Thesaurus, Brenda, Cell Type, etc.) lexically to EFO. Note other tools exist in this domain that would rely on information content to align the ontologies. As far as I know they use exact matching only, so our approach could in fact be more efficient and in my experience the information content approach is not adding much value to the alignment.
  7. Once we have the xrefs in, we can use a separate application BioportalImporter which will follow all the xrefs into respective external terms via BioPortal and import all the missing synonyms and definitions into EFO recording the source in a dedicated ‘bioportal_provenance’ annotation. With OWL2 it would be also possible to annotate the annotation directly.
  8. Part of the BioportalImporter code base is consistency checking which performs 13 different tests once the import is completed. Most importantly it will report if there were any changes in external resources by cross-referencing provenance information between two versions of the import, and also alert on any potentially duplicated terms, by verifying shared metadata between two distinct terms within EFO.BioportalImporter is not in public domain as it’s tied quite heavily into EFO specifics, but most of the ontology handling code is actually in OntoCAT.Overview of the tests:Malformed efourisChanged ontology annoationsChanged classesObsoleted classesRenamed classesDuplicated xrefsDuplicated synonyms or labelsDupplicated xrefs same as URILocal efoURIs on external classesChanged featuresChanged external classesCircular referencesNon-english characters in annotations
  9. Clean_ontology_terms.pl relies on the metaphone and double metaphone algorithms. Metaphone was developed by Lawrence Philips as a response to deficiencies in the Soundex algorithm. It uses a larger set of rules for English pronunciation. The aim of Metaphone is to match words or names that are pronounced similarly, according to the criteria of similarity which ignores any non-initial vowels and treats voiced and unvoiced versions of consonants as the same. Its latest versionMetaphone 3achieves an unparalleled level of accuracy in producing correct lookup keys for English words, non-English words familiar to English speakers, and names commonly found in the United States, within the criterion of similarity as defined above, but it is not designed to match words which are clearly pronounced differently. Recently publishedAnatomy ontologies and potential users: bridging the gap, Ravensara S Travillian1*, Tomasz Adamusiak1, Tony Burdett1, Michael Gruenberger2,John Hancock3, Ann-Marie Mallon3, James Malone1, Paul Schofield2 and Helen Parkinson1While the original aim of the article was to show how difficult it is to align the two anatomy ontologies: FMA and Uberon, the other conclusion that can be reached is that metaphone algorithms are inapplicable to this particular use case. Mostly importantly clean_ontology_terms.pl performed only marginally better than Zooma doing exact matching, with an enormous hit to precision (~0.07) as the script for lack of better matches would present all the phrases just starting with the same word (a side effect of double metaphonemisapplied on a whole phrase rather than individual words, this is a different behaviour from classic metaphone).
  10. Our input data is rarely about differences in spelling such as British tumour and American tumor, but ratherdifferent grammatical number (cell vs. cells), digits, typos, and differently ordered words in similar phrases.Here left column shows an example unmapped annotations from the Atlas. Right-hand column existing terms in EFO that we would like to semi-automatically map to EFO. The ontology is too big to handle manually and it is impossible to remember anymore whether a particular term has already been added, that’s why we need to automate this.
  11. First of all clean_ontology_terms.pl is not that fuzzy at all.Tim Rayner the original developer of clean_ontology_terms.pl already considered a more fuzzy approach, and there is a comment in the code suggesting the use of Levenhsteindistance. Rather than extending the script further, rewrote it from scratch into similarity_match.plThe Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.This algorithm, an example of bottom-up dynamic programming, which is is a method for solving complex problems by breaking them down into simpler subproblems. Similar approaches have already been extensively studied in DNA sequence alignment, and the edit distance approach is further generalised by local and global alignment algorithms: Smith–Waterman and Needleman-Wunsch, but they don’t offer much improvement for transpositions, i.e. different ordering of words in a phrase.And this is where n-grams excel.
  12. An n-gram is basically a fragment of n length from a given sequence.This idea can be traced to Claude Shannon's work in information theory in the 1900s, but it was Gravano et al. Who first suggested it for string querying in database applications.
  13. N-grams work particularly well for transpositions. This surprisingly simple and easy to implement approach allows some powerful fuzzy matching.The general idea is that you split the two strings in question into all the possible 2-character fragments (2-grams) and treat the number of shared n-grams between the two strings as their similarity metric. This can be easily normalised by dividing the shared number by the total number of n-grams in the longer string.Here we have three strings 19 characters long. The two suprsing things about using Levenshtein distance in this case is that not only both strings are quite low on the similarity, but also the completely different one is actually more similar. N-grams on the other hand deliver exactly the result that we’re expecting, with the sentence A being the most similar to the template, almost identical sharing 18 out of 20 possible 2-grams.Note there is a variation of Levenshtein distance called Damerau–Levenshtein, but it only allows for  transposition of two adjacent characters.
  14. clean_ontology_terms.pl is being retired in place of similarity_match.pl Emma (emma@ebi.ac.uk) refactored all the code and repackaged it for easier integration and reuse into a dedicated set of modules EBI::FGPT::FuzzyRecogniser (http://search.cpan.org/dist/EBI-FGPT-FuzzyRecogniser/) available on CPAN.
  15. Blowing my own trumpet here. The OntoCAT’sarticle was featured in the top 10 most accessed articles at BMC Bioinformatics a few months ago. The website (http://www.ontocat.org) sees about 1,000 pageviews monthly.
  16. But it was Natalja and Misha who stole the show with the ontocat R package included in Bioconductor. Googling for ‘ontology R’ will return the wiki page for the package as first hit, and the actual article as fourth. This is no small feat considering the prevalence of dedicated Gene Ontology R packages that otherwise predominate this space.
  17. An example of a directed acyclic graph representing all the relationships in an ontology for a particular EFO ontology term ‘EFO_0000815’ (heart). Edgesare labelled according to the relationship. Organism part classes are represented as ellipses and disease classes are shown as rectangles. The ontoCATpackage was used to compute the relationships which were later processed in Cytoscape (Cline et al., 2007).Converting the whole ontology to what is effectively RDF triples is a computationally intensive tasks, and takes about 30 minutes when run on 200 cluster nodes and parallelised by multiprocessing. It is demonstrated in Example 16 in the online documentation (http://www.ontocat.org/browser/trunk/ontoCAT/src/uk/ac/ebi/ontocat/examples/Example16.java)