SlideShare a Scribd company logo
Pipeline for automated structure-based
classification in the ChEBI ontology
Janna Hastings
Coordinator,
Cheminformatics and Metabolism
www.ebi.ac.uk/chebi
ACS Symposium on Chemical Ontologies,
Taxonomies and Schemas. Dallas, 16 March 2014
Chemical Entities of Biological Interest
Freely available
online, available
for download in full
Freely available
online, available
for download in full
Low molecular weight,
i.e. no proteins
Low molecular weight,
i.e. no proteins
Definitions,
relationships,
hierarchy
Definitions,
relationships,
hierarchy
E.g.
metabolites,
drugs,
pesticides
E.g.
metabolites,
drugs,
pesticides
38,215 entries last
release
38,215 entries last
release
What does ChEBI provide?
Chemical structures and
visualisations
caffeine
1,3,7-trimethylxanthine
methyltheobromine
Names and synonyms
Formula: C8H10N4O2
Charge: 0
Mass: 194.19
Chemical data
metabolite
CNS stimulant
trimethylxanthines
Ontology –
classifications
MSDchem: CFF
KEGG DRUG: D00528
PubMed citations
Links to more
information
Chemical Informatics
InChI=1/C8H10N4O2/c1-10-4-9-6-
5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3
SMILES CN1C(=O)N(C)c2ncn(C)c2C1=O
Example ChEBI entry page
Example entry page (continued)
Example entry page (continued)
Structure-based classification in ChEBI
Challenges with manual classification
• May be incomplete
• May be inconsistent
• Difficult to maintain (even with extensive use of
computationally expensive automatic validations)
• Blocks automatic loading of otherwise high-quality
externally annotated chemical data into ChEBI
(as no classification available)
SOCO (SMARTS, OWL)
Leonid Chepelev, Michel Dumontier, collaborators
• Given a training set of classified molecules, examine
structures for consensus features across all (using
fragmentation and feature detection)
• Capture features hierarchically
• Use OWL to classify
Chepelev et al. BMC Bioinformatics 2012 13:3 doi:10.1186/1471-2105-13-3
Limitations of SOCO
• No support for negation
• Only “min” (at least) counting supported, not max or
exact. Thus, dicarboxylic acid is_a monocarboxylic acid
(Every two-legged human is also a one-legged human in the sense
that they have at least one leg…)
• SMARTS is powerful – but not very human-readable.
ChEBI is for human biologist and chemist consumption.
E.g. SMARTS for the class of aliphatic amines: [$([NH2][CX4]),$
([NH]([CX4])[CX4]),$[NX3]([CX4])([CX4])[CX4])]
Can we do better at making definitions accessible?
A new pipeline for automated structure-
based ontology classification in ChEBI
Definitions (OWL)
ChEBI structures
OWL Parser =>
logical
cheminformatics
definitions
OWL Parser =>
logical
cheminformatics
definitions
Novel
structure
Candidate
classes
RankingRankingBest classes: save is_a relations
MatchingMatching
Human-readable definitions, mapped to
structures in ChEBI knowledgebase
thiadiazoles:
molecular_entity and has_part
some ( 1,2,3-thiadiazole or 1,2,4-thiadiazole
or 1,2,5-thiadiazole or 1,3,4-thiadiazole )
diterpenoid: organic_molecular_entity and
has_part exactly 2 terpenoid
organic ion: organic_molecular_entity and
( has_charge some int[>0] or has_charge some int[<0] )
monocyclic compound: molecular_entity and
has_cycles value "1"^^int
Logical operatorsLogical operators
Counts (min, max
and exact)
Counts (min, max
and exact)
PropertiesProperties
PartsParts
Planned integration into ChEBI tools
• ChEBI internal data loader and bulk submissions
• ChEBI online submission tool
Pre-population
of matched
classes
Pre-population
of matched
classes
Acknowledgements – Thanks!
ChEBI team:
Christoph Steinbeck
Gareth Owen
Adriano Dekker
Namrata Kale
Steve Turner
Venkatesh Muthukrishnan
Collaborators:
Colin Batchelor, RSC
Lian Duan, ETH
Leonid Chepelev, Ottawa
Michel Dumontier, Stanford
Despoina Magka, Oxford
Ilinca Tudose and John May, EBI
Funding:
BBSRC “Continued
development of ChEBI towards
better usability for the systems
biology and metabolic
modelling communities”
BB/K019783/1
Questions?
Thank you for listening!
chebi-help@ebi.ac.uk
ACS Symposium on Chemical Ontologies,
Taxonomies and Schemas. Dallas, 16 March 2014

More Related Content

Similar to Pipeline for automated structure-based classification in the ChEBI ontology

100505 koenig biological_databases
100505 koenig biological_databases100505 koenig biological_databases
100505 koenig biological_databases
Meetika Gupta
 
Protein databases
Protein databasesProtein databases
Protein databases
sarumalay
 
Continued development of ChEBI towards better usability for the systems biolo...
Continued development of ChEBI towards better usability for the systems biolo...Continued development of ChEBI towards better usability for the systems biolo...
Continued development of ChEBI towards better usability for the systems biolo...
Neil Swainston
 

Similar to Pipeline for automated structure-based classification in the ChEBI ontology (20)

Accessing small molecule data using ChEBI
Accessing small molecule data using ChEBIAccessing small molecule data using ChEBI
Accessing small molecule data using ChEBI
 
2009 CSBB LAB 新生訓練
2009 CSBB LAB 新生訓練 2009 CSBB LAB 新生訓練
2009 CSBB LAB 新生訓練
 
II-SDV 2017: The "International Chemical Ontology Network"
II-SDV 2017: The "International Chemical Ontology Network" II-SDV 2017: The "International Chemical Ontology Network"
II-SDV 2017: The "International Chemical Ontology Network"
 
PhDc exam presentation
PhDc exam presentationPhDc exam presentation
PhDc exam presentation
 
Types of biological databases-protein database
Types of biological databases-protein databaseTypes of biological databases-protein database
Types of biological databases-protein database
 
protein.pptx
protein.pptxprotein.pptx
protein.pptx
 
100505 koenig biological_databases
100505 koenig biological_databases100505 koenig biological_databases
100505 koenig biological_databases
 
Protein databases
Protein databasesProtein databases
Protein databases
 
Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)
 
Continued development of ChEBI towards better usability for the systems biolo...
Continued development of ChEBI towards better usability for the systems biolo...Continued development of ChEBI towards better usability for the systems biolo...
Continued development of ChEBI towards better usability for the systems biolo...
 
cath-171102055313.pptx
cath-171102055313.pptxcath-171102055313.pptx
cath-171102055313.pptx
 
Automatic vs manual curation of a multisource chemical dictionary
Automatic vs manual curation of a multisource chemical dictionaryAutomatic vs manual curation of a multisource chemical dictionary
Automatic vs manual curation of a multisource chemical dictionary
 
Metabolite Set Enrichment Analysis (ChemRICH)
Metabolite Set Enrichment Analysis (ChemRICH)Metabolite Set Enrichment Analysis (ChemRICH)
Metabolite Set Enrichment Analysis (ChemRICH)
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIR
 
Pep Talk San Diego 011311
Pep Talk San Diego 011311Pep Talk San Diego 011311
Pep Talk San Diego 011311
 
Protein database
Protein databaseProtein database
Protein database
 
Implications of structural and chemical data bases
Implications of structural and chemical data basesImplications of structural and chemical data bases
Implications of structural and chemical data bases
 
Bioinformatics introduction
Bioinformatics introductionBioinformatics introduction
Bioinformatics introduction
 
Databases pathways of genomics and proteomics
Databases pathways of genomics and proteomics Databases pathways of genomics and proteomics
Databases pathways of genomics and proteomics
 
Biomedical literature mining
Biomedical literature miningBiomedical literature mining
Biomedical literature mining
 

More from Janna Hastings

Representing sequences of parts in processes using OWL
Representing sequences of parts in processes using OWLRepresenting sequences of parts in processes using OWL
Representing sequences of parts in processes using OWL
Janna Hastings
 

More from Janna Hastings (20)

Using ChEBI to explore the underlying biology in metabolomics studies
Using ChEBI to explore the underlying biology in metabolomics studiesUsing ChEBI to explore the underlying biology in metabolomics studies
Using ChEBI to explore the underlying biology in metabolomics studies
 
Chemical classification for the Semantic Web
Chemical classification for the Semantic WebChemical classification for the Semantic Web
Chemical classification for the Semantic Web
 
Emotion Ontology and Affective Neuroscience
Emotion Ontology and Affective NeuroscienceEmotion Ontology and Affective Neuroscience
Emotion Ontology and Affective Neuroscience
 
Waves and fields in bio-ontologies
Waves and fields in bio-ontologiesWaves and fields in bio-ontologies
Waves and fields in bio-ontologies
 
Representing addiction in Mental Functioning and Disease ontologies
Representing addiction in Mental Functioning and Disease ontologiesRepresenting addiction in Mental Functioning and Disease ontologies
Representing addiction in Mental Functioning and Disease ontologies
 
Bio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challengesBio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challenges
 
Mental functioning ontology for interdisciplinary research into mental diseas...
Mental functioning ontology for interdisciplinary research into mental diseas...Mental functioning ontology for interdisciplinary research into mental diseas...
Mental functioning ontology for interdisciplinary research into mental diseas...
 
From chemicals to minds: Integrated ontologies in the search for scientific u...
From chemicals to minds: Integrated ontologies in the search for scientific u...From chemicals to minds: Integrated ontologies in the search for scientific u...
From chemicals to minds: Integrated ontologies in the search for scientific u...
 
Modularity requirements in bio-ontologies: a case study of ChEBI
Modularity requirements in bio-ontologies: a case study of ChEBIModularity requirements in bio-ontologies: a case study of ChEBI
Modularity requirements in bio-ontologies: a case study of ChEBI
 
The SHAPES workshop, and Holes in living beings
The SHAPES workshop, and Holes in living beings The SHAPES workshop, and Holes in living beings
The SHAPES workshop, and Holes in living beings
 
A chemical view into biological systems
A chemical view into biological systemsA chemical view into biological systems
A chemical view into biological systems
 
Chemical diagrams and the IAO
Chemical diagrams and the IAOChemical diagrams and the IAO
Chemical diagrams and the IAO
 
The emotion ontology: enabling interdisciplinary research in the affective sc...
The emotion ontology: enabling interdisciplinary research in the affective sc...The emotion ontology: enabling interdisciplinary research in the affective sc...
The emotion ontology: enabling interdisciplinary research in the affective sc...
 
Hyperontology for the biomedical ontologist
Hyperontology for the biomedical ontologistHyperontology for the biomedical ontologist
Hyperontology for the biomedical ontologist
 
Using multiple ontologies to characterise the bioactivity of small molecules
Using multiple ontologies to characterise the bioactivity of small moleculesUsing multiple ontologies to characterise the bioactivity of small molecules
Using multiple ontologies to characterise the bioactivity of small molecules
 
Processes and Properties
Processes and PropertiesProcesses and Properties
Processes and Properties
 
Representing sequences of parts in processes using OWL
Representing sequences of parts in processes using OWLRepresenting sequences of parts in processes using OWL
Representing sequences of parts in processes using OWL
 
Modelling metabolite concentrations in OWL using Pronto
Modelling metabolite concentrations in OWL using ProntoModelling metabolite concentrations in OWL using Pronto
Modelling metabolite concentrations in OWL using Pronto
 
Chemical ontologies: what are they, what are they for, and what are the chall...
Chemical ontologies: what are they, what are they for, and what are the chall...Chemical ontologies: what are they, what are they for, and what are the chall...
Chemical ontologies: what are they, what are they for, and what are the chall...
 
Ontological dependence, dispositions and institutional reality in chemistry
Ontological dependence, dispositions and institutional reality in chemistryOntological dependence, dispositions and institutional reality in chemistry
Ontological dependence, dispositions and institutional reality in chemistry
 

Recently uploaded

Recently uploaded (20)

Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 

Pipeline for automated structure-based classification in the ChEBI ontology

  • 1. Pipeline for automated structure-based classification in the ChEBI ontology Janna Hastings Coordinator, Cheminformatics and Metabolism www.ebi.ac.uk/chebi ACS Symposium on Chemical Ontologies, Taxonomies and Schemas. Dallas, 16 March 2014
  • 2. Chemical Entities of Biological Interest Freely available online, available for download in full Freely available online, available for download in full Low molecular weight, i.e. no proteins Low molecular weight, i.e. no proteins Definitions, relationships, hierarchy Definitions, relationships, hierarchy E.g. metabolites, drugs, pesticides E.g. metabolites, drugs, pesticides 38,215 entries last release 38,215 entries last release
  • 3. What does ChEBI provide? Chemical structures and visualisations caffeine 1,3,7-trimethylxanthine methyltheobromine Names and synonyms Formula: C8H10N4O2 Charge: 0 Mass: 194.19 Chemical data metabolite CNS stimulant trimethylxanthines Ontology – classifications MSDchem: CFF KEGG DRUG: D00528 PubMed citations Links to more information Chemical Informatics InChI=1/C8H10N4O2/c1-10-4-9-6- 5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3 SMILES CN1C(=O)N(C)c2ncn(C)c2C1=O
  • 5. Example entry page (continued)
  • 6. Example entry page (continued)
  • 8. Challenges with manual classification • May be incomplete • May be inconsistent • Difficult to maintain (even with extensive use of computationally expensive automatic validations) • Blocks automatic loading of otherwise high-quality externally annotated chemical data into ChEBI (as no classification available)
  • 9. SOCO (SMARTS, OWL) Leonid Chepelev, Michel Dumontier, collaborators • Given a training set of classified molecules, examine structures for consensus features across all (using fragmentation and feature detection) • Capture features hierarchically • Use OWL to classify Chepelev et al. BMC Bioinformatics 2012 13:3 doi:10.1186/1471-2105-13-3
  • 10. Limitations of SOCO • No support for negation • Only “min” (at least) counting supported, not max or exact. Thus, dicarboxylic acid is_a monocarboxylic acid (Every two-legged human is also a one-legged human in the sense that they have at least one leg…) • SMARTS is powerful – but not very human-readable. ChEBI is for human biologist and chemist consumption. E.g. SMARTS for the class of aliphatic amines: [$([NH2][CX4]),$ ([NH]([CX4])[CX4]),$[NX3]([CX4])([CX4])[CX4])] Can we do better at making definitions accessible?
  • 11. A new pipeline for automated structure- based ontology classification in ChEBI Definitions (OWL) ChEBI structures OWL Parser => logical cheminformatics definitions OWL Parser => logical cheminformatics definitions Novel structure Candidate classes RankingRankingBest classes: save is_a relations MatchingMatching
  • 12. Human-readable definitions, mapped to structures in ChEBI knowledgebase thiadiazoles: molecular_entity and has_part some ( 1,2,3-thiadiazole or 1,2,4-thiadiazole or 1,2,5-thiadiazole or 1,3,4-thiadiazole ) diterpenoid: organic_molecular_entity and has_part exactly 2 terpenoid organic ion: organic_molecular_entity and ( has_charge some int[>0] or has_charge some int[<0] ) monocyclic compound: molecular_entity and has_cycles value "1"^^int Logical operatorsLogical operators Counts (min, max and exact) Counts (min, max and exact) PropertiesProperties PartsParts
  • 13. Planned integration into ChEBI tools • ChEBI internal data loader and bulk submissions • ChEBI online submission tool Pre-population of matched classes Pre-population of matched classes
  • 14. Acknowledgements – Thanks! ChEBI team: Christoph Steinbeck Gareth Owen Adriano Dekker Namrata Kale Steve Turner Venkatesh Muthukrishnan Collaborators: Colin Batchelor, RSC Lian Duan, ETH Leonid Chepelev, Ottawa Michel Dumontier, Stanford Despoina Magka, Oxford Ilinca Tudose and John May, EBI Funding: BBSRC “Continued development of ChEBI towards better usability for the systems biology and metabolic modelling communities” BB/K019783/1
  • 15. Questions? Thank you for listening! chebi-help@ebi.ac.uk ACS Symposium on Chemical Ontologies, Taxonomies and Schemas. Dallas, 16 March 2014

Editor's Notes

  1. ChEBI is a database and ontology of chemical entities of biological interest. As of October 2013, it contains more than 35,000 entries, organised into a structure-based and role-based classification hierarchy. Each entry is extensively annotated with a name, definition and synonyms, other metadata such as cross-references, and chemical structure information where appropriate. In addition to the classification hierarchy, the ontology also contains diverse chemical and ontological relationships. While ChEBI is primarily manually maintained, recent developments have focused on improvements in curation through partial automation of common tasks. We will describe a pipeline we have developed for structure-based classification of chemicals into the ChEBI structural classification. The pipeline connects class-level structural knowledge encoded in Web Ontology Language (OWL) axioms as an extension to the ontology, and structural information specified in standard MOLfiles. We make use of the Chemistry Development Kit, the OWL API and the OWLTools library. Harnessing the pipeline, we are able to suggest the best structural classes for the classification of novel structures within the ChEBI ontology.