k-BOOM
A Bayesian approach to ontology structure inference,
with applications in disease ontology construction
Chris Mungall
Lawrence Berkeley Laboratory
PhenoDay 2016
@monarchinit
@chrismungall
Building a cohesive, complete disease
ontology
Objective
• Combine existing disease
classifications and lists into
unified cohesive
framework
• Best of all worlds
• Integrate data from multiple
resources
Challenges
• Current resources
developed independently,
different perspectives
• Mappings are imprecise
OMIM Orphanet DO MESH NCIT
Deciphe
r
ICD SNOMED
Combined, coherent view
Disease classifications and why
mappings are not enough
• Given N disease lists
– Where each provides cross-references
(xrefs) to up to N-1 others
– Up to (N^2)-N sets of mappings
• Even more with 3rd party mappings
– These are frequently
• Inconsistent (directly or indirectly)
• Different meanings and levels of specificity
• Incomplete
• Stale
• Difficult to computationally verify
• Fundamental issue
– Xrefs lack semantics
– Explicit semantics would enable
computational checks
Ont1
Ont2 Ont3
Ont4
Ont5
Ont6
DOID
(blue)
OMIM
(brown)
MESH
(grey)
ORDO/Orphanet
(yellow)
SubClassOf
(solid line)
Xref
(dashed grey line)
4 disease resources
plus mappings:
Hemolytic anemia
Objective: Coherent OWL Ontology
Merging (OOM)
• Criteria for OOM
– Merged
• Combines multiple lists and classifications (terminologies
and lists treated as ‘degenerate’ ontologies), Presented as a
single ontology
• Equivalent classes merged
– Logically Connected
• OWL/Description Logic constructs
– e.g. SubClassOf, EquivalentClass, SomeValuesFrom
• Not xrefs
– Coherent
• Logically coherent: no unsatisfiable classes
• Biologically coherent: makes biological and clinical sense
Our previous approach, applied to
phenotypes: L-DOOM
Logical Definition based OWL Ontology Merging
Mungall, C. J., Gkoutos, G., Smith, C., Haendel, M., Lewis, S., & Ashburner, M. (2010). Integrating phenotype ontologies across multiple species. Genome Biology, 11(1), R2.
doi:10.1186/gb-2010-11-1-r2
Köhler, S., Doelken, S. C., Ruef, B. J., Bauer, S., Washington, N., Westerfield, M., … Mungall, C. J. (2013). Construction and accessibility of a cross-species phenotype ontology
along with gene annotations for biomedical research. F1000Research, 1–12. doi:10.3410/f1000research.2-30.v1
Application to diseases?
• Works well for compositional classes (e.g. many cancer terms)
• Less well for genetic diseases, complex syndromes
1. Assign Logical Definitions
(OWL equivalence axioms) to
classes in each ontology
• Can be assigned
manually or semi-
automatically (Obol)
HP:0002180
Neuro-
degeneration
MP:0000876
Purkinje cell
degeneration
Equiv
CL:0000540
neuron
CL:0000121
Purkinje cell
Equiv
degenerate
AND
inheres-in SOME
neuron
degenerate
AND
inheres-in SOME
Purkinje cell
2. Using reasoning to infer logical
axioms
SubClassOf
Probabilistic Ontology OP = <A,H>
BOOM Bayes OWL Ontology Merging:
Finds the set of hypothetical axioms that maximises P(OP)
Merged Coherent
OWL Ontology
Elk
Reasoner
Ontology 1
Inter-
Ontology
Mappings
mapping
tool
Ontology 2
Ontology ..
Ontology n
Hypothetical
Logical Axioms
plus Weights (H)
mapping
curation
Axiom Weight Estimator
Weight
Curation
Next iteration
Merge equivalent
classes
Generating hypothetical logical axioms
Inter-
Ontology
Mappings
Hypothetical
Logical Axioms
plus Weights (H)
Axiom Weight Estimator
E.g:
OMIM:123 xref
DOID:987
Pr(OMIM:123 ≡ DOID:987) = 0.3
Pr(OMIM:123 ⊂ DOID:987) = 0.4
Pr(OMIM:123 ⊃DOID:987) = 0.1
Domain rules
(lexical, structural, …):
K-BOOM Algorithm for finding most
likely merged ontology
1. Factorize calculation by dividing combined
axioms into k modules (k-BOOM)
Algorithm:
i. Assert all hypothetical axioms to be true,
ii. Make module from equivalence clique
Find values for H that maximises P.
Problem: 2^N ontologies
hi
: boolean representing truth value of hypothetical axiom Hi
2. Use greedy algorithm; start with
Most likely hypothetical axioms in Ok
3. Test each configuration using OWL
Reasoner (Elk) for satisfiability
(unsat => Pr=0), calc posterior probability
4. Repeat until number of tests
exceeds threshold
5. Return most likely configuration for Ok
Probability guided curator workflow:
A little knowledge goes a long way
• Run cycle
• Examine results for modules
with:
– low posterior probability
– low confidence (top ranked
solution has similar P to next
ranked)
– Pr(H_i = true) << threshold
• Apply biological/clinical
knowledge
• Override auto-generated
hypothetical axiom weights with
curated ones
– Feedback issues to source
ontologies
• Repeat
dialog
Mondo
curator
External
ontology
curator
Application: merging diseases into
MonDO
https://github.com/monarch-initiative/monarch-disease-ontology
“Ontology” Classes (before, after
merge)
SubClass axioms Xrefs
Inputs:
DOID 6878  6012 7082 36656
MESH (D) 11314  4152 19036
OMIM (D) 7783  7783 0 31242
Orphanet (D) 8740  4683 15182 20326
OMIA 4833  4833 3120 355
DC 209  208 310 316
Medic 0 8630 3435
Output:
MonDO 39757  27617 44837
Held back: NCIT, SNOMED, ICD9, GARD
Example Module Resolution: ITM2B
amyloidosis
Example failed resolution – due to
ontology error
https://github.com/monarch-initiative/monarch-disease-ontology/issues/99
https://github.com/DiseaseOntology/HumanDiseaseOntology/issues/164
Example failed resolution – due to
mesh duplicates
https://github.com/monarch-initiative/monarch-disease-ontology/issues/81
Evaluating results of disease merger
• No gold standard for multiple ontology merger
– Partial evaluation using held-back Orphanet NTBT/E calls:
• 6977/7986 (87% agreement)
• Ad-hoc evaluation by curator
– Approach: use posterior probabilities to rank modules requiring
attention
– This is the killer-app feature
– Iteratively refine curated probabilities
• https://github.com/monarch-initiative/monarch-disease-ontology/issues/
• Results
– Manual inspection and use of mondo
– Detection of errors in source ontologies
• E.g. duplicates in MESH
• Incorrect xrefs in DO, e.g.
– https://github.com/DiseaseOntology/HumanDiseaseOntology/issues - issues #164, #163,
#156, #154, #151, #150, #149, #140, #135
Next Steps
• Integrate hypothetical axiom weight estimation into
Bayesian model
• Apply Markov Chain Monte Carlo (MCMC) methods for
estimating most likely graph
– E.g Metropolis-Hastings
• Integrate other knowledge
– Logical Definitions (Phenotypes)
– Molecular knowledge
• Improve Evaluation
– Test k-BOOM on task where we have gold standard, e.g.
neuroanatomy/uberon
– Formal comparison with EFO, MedGen, …
Discussion
• Retrospective merging vs prospective
development
– Better to work together from outset (OBO model)
– However, current state of affairs is such that
expert knowledge is distributed across resources
– We want to preserve that rather than reinvent
– Coherent merging of molecular knowledge with
classical top-down knowledge will be required
moving forward
Implementation/Availability
• Software
– https://github.com/monarch-initiative/kboom
• Paper
– https://github.com/cmungall/kboom-paper
– http://biorxiv.org/content/early/2016/04/15/048843
• MonDO
– https://github.com/monarch-initiative/monarch-
disease-ontology
– Both OWL ontology and axiom weight rules
Acknowledgments
k-BOOM
• Ian Holmes
• Sebastian Kohler
• Jim Balhoff
• Peter Robinson
• Melissa Haendel
Curation
• Nicole Vasilesky (MonDO,
DC)
• Sue Bello (DC)
• Elvira Mitraka (DO)
• Lynn Shriml (DO)
FUNDING: NIH Office of Director: 1R24OD011883; NIH-UDP:
HHSN268201300036C

Kboom phenoday-2016

  • 1.
    k-BOOM A Bayesian approachto ontology structure inference, with applications in disease ontology construction Chris Mungall Lawrence Berkeley Laboratory PhenoDay 2016 @monarchinit @chrismungall
  • 2.
    Building a cohesive,complete disease ontology Objective • Combine existing disease classifications and lists into unified cohesive framework • Best of all worlds • Integrate data from multiple resources Challenges • Current resources developed independently, different perspectives • Mappings are imprecise OMIM Orphanet DO MESH NCIT Deciphe r ICD SNOMED Combined, coherent view
  • 3.
    Disease classifications andwhy mappings are not enough • Given N disease lists – Where each provides cross-references (xrefs) to up to N-1 others – Up to (N^2)-N sets of mappings • Even more with 3rd party mappings – These are frequently • Inconsistent (directly or indirectly) • Different meanings and levels of specificity • Incomplete • Stale • Difficult to computationally verify • Fundamental issue – Xrefs lack semantics – Explicit semantics would enable computational checks Ont1 Ont2 Ont3 Ont4 Ont5 Ont6
  • 4.
  • 7.
    Objective: Coherent OWLOntology Merging (OOM) • Criteria for OOM – Merged • Combines multiple lists and classifications (terminologies and lists treated as ‘degenerate’ ontologies), Presented as a single ontology • Equivalent classes merged – Logically Connected • OWL/Description Logic constructs – e.g. SubClassOf, EquivalentClass, SomeValuesFrom • Not xrefs – Coherent • Logically coherent: no unsatisfiable classes • Biologically coherent: makes biological and clinical sense
  • 8.
    Our previous approach,applied to phenotypes: L-DOOM Logical Definition based OWL Ontology Merging Mungall, C. J., Gkoutos, G., Smith, C., Haendel, M., Lewis, S., & Ashburner, M. (2010). Integrating phenotype ontologies across multiple species. Genome Biology, 11(1), R2. doi:10.1186/gb-2010-11-1-r2 Köhler, S., Doelken, S. C., Ruef, B. J., Bauer, S., Washington, N., Westerfield, M., … Mungall, C. J. (2013). Construction and accessibility of a cross-species phenotype ontology along with gene annotations for biomedical research. F1000Research, 1–12. doi:10.3410/f1000research.2-30.v1 Application to diseases? • Works well for compositional classes (e.g. many cancer terms) • Less well for genetic diseases, complex syndromes 1. Assign Logical Definitions (OWL equivalence axioms) to classes in each ontology • Can be assigned manually or semi- automatically (Obol) HP:0002180 Neuro- degeneration MP:0000876 Purkinje cell degeneration Equiv CL:0000540 neuron CL:0000121 Purkinje cell Equiv degenerate AND inheres-in SOME neuron degenerate AND inheres-in SOME Purkinje cell 2. Using reasoning to infer logical axioms SubClassOf
  • 9.
    Probabilistic Ontology OP= <A,H> BOOM Bayes OWL Ontology Merging: Finds the set of hypothetical axioms that maximises P(OP) Merged Coherent OWL Ontology Elk Reasoner Ontology 1 Inter- Ontology Mappings mapping tool Ontology 2 Ontology .. Ontology n Hypothetical Logical Axioms plus Weights (H) mapping curation Axiom Weight Estimator Weight Curation Next iteration Merge equivalent classes
  • 10.
    Generating hypothetical logicalaxioms Inter- Ontology Mappings Hypothetical Logical Axioms plus Weights (H) Axiom Weight Estimator E.g: OMIM:123 xref DOID:987 Pr(OMIM:123 ≡ DOID:987) = 0.3 Pr(OMIM:123 ⊂ DOID:987) = 0.4 Pr(OMIM:123 ⊃DOID:987) = 0.1 Domain rules (lexical, structural, …):
  • 11.
    K-BOOM Algorithm forfinding most likely merged ontology 1. Factorize calculation by dividing combined axioms into k modules (k-BOOM) Algorithm: i. Assert all hypothetical axioms to be true, ii. Make module from equivalence clique Find values for H that maximises P. Problem: 2^N ontologies hi : boolean representing truth value of hypothetical axiom Hi 2. Use greedy algorithm; start with Most likely hypothetical axioms in Ok 3. Test each configuration using OWL Reasoner (Elk) for satisfiability (unsat => Pr=0), calc posterior probability 4. Repeat until number of tests exceeds threshold 5. Return most likely configuration for Ok
  • 12.
    Probability guided curatorworkflow: A little knowledge goes a long way • Run cycle • Examine results for modules with: – low posterior probability – low confidence (top ranked solution has similar P to next ranked) – Pr(H_i = true) << threshold • Apply biological/clinical knowledge • Override auto-generated hypothetical axiom weights with curated ones – Feedback issues to source ontologies • Repeat dialog Mondo curator External ontology curator
  • 13.
    Application: merging diseasesinto MonDO https://github.com/monarch-initiative/monarch-disease-ontology “Ontology” Classes (before, after merge) SubClass axioms Xrefs Inputs: DOID 6878  6012 7082 36656 MESH (D) 11314  4152 19036 OMIM (D) 7783  7783 0 31242 Orphanet (D) 8740  4683 15182 20326 OMIA 4833  4833 3120 355 DC 209  208 310 316 Medic 0 8630 3435 Output: MonDO 39757  27617 44837 Held back: NCIT, SNOMED, ICD9, GARD
  • 14.
    Example Module Resolution:ITM2B amyloidosis
  • 15.
    Example failed resolution– due to ontology error https://github.com/monarch-initiative/monarch-disease-ontology/issues/99 https://github.com/DiseaseOntology/HumanDiseaseOntology/issues/164
  • 16.
    Example failed resolution– due to mesh duplicates https://github.com/monarch-initiative/monarch-disease-ontology/issues/81
  • 17.
    Evaluating results ofdisease merger • No gold standard for multiple ontology merger – Partial evaluation using held-back Orphanet NTBT/E calls: • 6977/7986 (87% agreement) • Ad-hoc evaluation by curator – Approach: use posterior probabilities to rank modules requiring attention – This is the killer-app feature – Iteratively refine curated probabilities • https://github.com/monarch-initiative/monarch-disease-ontology/issues/ • Results – Manual inspection and use of mondo – Detection of errors in source ontologies • E.g. duplicates in MESH • Incorrect xrefs in DO, e.g. – https://github.com/DiseaseOntology/HumanDiseaseOntology/issues - issues #164, #163, #156, #154, #151, #150, #149, #140, #135
  • 18.
    Next Steps • Integratehypothetical axiom weight estimation into Bayesian model • Apply Markov Chain Monte Carlo (MCMC) methods for estimating most likely graph – E.g Metropolis-Hastings • Integrate other knowledge – Logical Definitions (Phenotypes) – Molecular knowledge • Improve Evaluation – Test k-BOOM on task where we have gold standard, e.g. neuroanatomy/uberon – Formal comparison with EFO, MedGen, …
  • 19.
    Discussion • Retrospective mergingvs prospective development – Better to work together from outset (OBO model) – However, current state of affairs is such that expert knowledge is distributed across resources – We want to preserve that rather than reinvent – Coherent merging of molecular knowledge with classical top-down knowledge will be required moving forward
  • 20.
    Implementation/Availability • Software – https://github.com/monarch-initiative/kboom •Paper – https://github.com/cmungall/kboom-paper – http://biorxiv.org/content/early/2016/04/15/048843 • MonDO – https://github.com/monarch-initiative/monarch- disease-ontology – Both OWL ontology and axiom weight rules
  • 21.
    Acknowledgments k-BOOM • Ian Holmes •Sebastian Kohler • Jim Balhoff • Peter Robinson • Melissa Haendel Curation • Nicole Vasilesky (MonDO, DC) • Sue Bello (DC) • Elvira Mitraka (DO) • Lynn Shriml (DO) FUNDING: NIH Office of Director: 1R24OD011883; NIH-UDP: HHSN268201300036C

Editor's Notes

  • #2 20 minutes. Sat July 9. 9.40am
  • #3 TODO: Make data integration
  • #5 https://github.com/monarch-initiative/monarch-disease-ontology/issues/90 Note the two subgraphs; little overlap in the upper areas
  • #6 Note Typical (top left) and Atypical are connected
  • #7 Note Typical (top left) and Atypical are connected
  • #10 We treat every resource as an ontology, even the degenerate case where it’s a flat list (e.g. OMIM). Pink = novel
  • #11 Heuristic/ad-hoc
  • #15 Fig. 2. Module resolution graph exported by kBOOM; Initial input is nodes plus solid arrows (SubClassOf axioms in ORDO). Dotted lines are supplied mappings (no logical interpretation). Figure shows inferred most likely configuration. equivalence=red, subclass=blue, with prior probabilities written as edge labels (thick lines more probable). Enclosing boxes denote equivalence cliques, which can be merged to a single class, yielding a grouping class with two children.
  • #18 TODO: Example of dupes in MESH Highlight flipping example