This document discusses using ontology learning from text to build ontologies. It describes using the Text2Onto tool to extract terms and hierarchies from a corpus of text about blood cells. The experiment showed that while not perfect, simple rules could extract most relevant terms from the corpus and organize them into a basic ontology structure reflecting cell types and relationships. Iteratively augmenting rules and focusing learning improved results. While not replicating a reference ontology exactly, the extracted ontology was deemed useful from a cell biology perspective.
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Ontology Learning From Text Using Text2Onto
1. Ontology Learning From Text?
Robert Stevens
BioHealth Informatics Group
School of Computer Science
University of Manchester
Robert.stevens@manchester.ac.uk
2. Introduction
• Can we use ontology learning to build
ontologies?
• Not text-mining research, but ontology
research
• What is ontology learning from text?
• The questions we posed
• The experiment we performed
• The results we obtained
• The conclusions we made
3. Ontology learning
• Text2Onto: http://ontoware
.org/projects/text2onto/
• “The erythrocytes are the blood cells that carry
oxygen to others cells in the body”
• “Lymphocytes, leukocytes, monocytes, phagocytes
and granulocytes are all kinds of white blood cell”
• “These experiments show that the individual
hemopoietic stem cell is a multipotent cell and can
give rise to the complete range of blood cell types,
both myeloid and lymphoid, as well as new stem cells
like itself.”
5. Text to Ontology “Workflow”
Corpus
Tokenising /
Sentence splitting
Part-Of-Speech
(POS) tagging
Lemmatizing /
Stemming
JAPE transducer
annotates corpus
Text2Onto Algorithms for
extracting modeling primitive
Text2Onto
meta-ontology
Promotion to
OWL ontology
6. Extracting Patterns from Text
“CFU-S is a blood stem cell”
CFU-S[NNP] is[VBN] a[DT] blood[NN] stem[NN] cell[NN]
Sentence:
Part of Speech (POS) Tagging:
Pseudo JAPE rule:
Any series of nouns (A) followed by the string “ is a ”
followed by series of nouns (B)
Key: NN=noun; DT=determiner; NNP=proper noun; VBN = verb past participle.
Ontological assertions:
A and B are concepts, A is a subclass of B
8. Some Text2Onto Instances
• Instance: Astrocyte_c
– typeOf: Concept that
– Fact: confidence VALUE 1.0
Instance: AstrocycteNerveCell
TypeOf: Subclass that
Fact: domain VALUE NerveCell and
FACT: Range VALUE Astrocyte and
Fact: confidence VALUE 1.0
9. The Questions We Asked
• Can we press the button and get a
good ontology?
• If not, can we get something useful?
• Can we do it without having to write too
many rules?
• Does the end-point act as as a donor or
recipient ontology?
10. Strategy
• Collect corpus
• Manually markup text for cells: Definitive list
of terms
• Process corpus through T2O
• Analyse output of T2O for recall and precision
of terms and hierarchy
• Iteration of previous two step with variants in
rules
• Evaluation against CTO gold standard
11. The Experimental Conditions
• Default T2O
• T2O plus cell specific JAPE rules and all
algorithms
• Only cell specific JAPE rules,
/EntropyExtraction Algorithm and some
“hierarchy spotting” based on term
composition
• Same 3, but with
VerticalRelationsConceptClassification to
include our simple JAPE rules
• Same 4, but with WordConceptClassificaiton
12. Rules for Extracting Cell
Types
• Words ending in ‘cyte’, ‘blast’, ‘cell’, ‘glia’, ‘glium’, ‘cell type’, ‘cell line’
and ‘cell lineage’ (together with their plurals)
• Zero or more adjectives followed by zero or more nouns or proper
nouns followed by a ‘cell word’ (together with plural) e.g. ‘renshaw cell’,
‘Muller cell’, ‘immature blood cell’, etc..
• Any stem cell term is a stem cell
• Any term ending with ‘progeneitor cell’ is a Progenitor Cell.
• Any term ending with ‘precursor cell’ is a Precursor Cell.
• Any term ending in ‘blast’ is a Blast Cell.
• Any term ending with ‘cyte’ or ‘cell’ is a Differentiated Cell.
14. Term Recognition
• 1,277 terms in our definitive list
• 16,384 terms from whole corpus; 625
relevant
• Increase to 17,851 and 916
• All 118 CTO terms in corpus recalled
• Corpus has anatomical bias
• Simple rules exploit regularity of language
• Many false positives from adjective noun rule
16. Common errors
Manually
extracted from
corpus
Automatically
extracted from
corpus
Comments
+t - cell Symbols not handled very well
contains cell False -positive cell type
Foam cell New cell type extracted
leukocyte leucocyte Spelling errors in corpus
naïve cell nave cell Character encoding problem
Spermatogonia No rule to extract
22. Discussion
• Exploiting poor performance to focus learning
• Exploiting regularity of language
• Never really going to find CTO domain
general layer
• Terms highly compositional and conflate axes
• Ask the questions “is it useful?” not “is it
good?”
• Is CTO a good standard?
• The extracted hierarchy was not bad from a
cell biology and ontological point of view
23. Nascent Methodology
• Form corpus that includes, but is not limited
to scope of target ontology
• Extract terms from corpus
• Filter and massage list of terms to find those
of ontological interest
• Use ontology learning to see what happens
• Inspect and augment rules to recognise and
incorporate into hierarchy
• Iterate Use as donor ontology to transfer
useful bits to recipient ontology
25. Acknowledgements
• Simon Jupp has done the work
• Jaclyn Bibby MSc Project prototype
• Johanna Volker for help with Text2Onto
• David Shotton for knowledge about cell
biology
Graph of increasing term and recall over 5 experimental conditio. Recall 50% -> 72%. Precision 4% -> 49%.
OWLViz default ontology. Show fibroblast is a cell, also incorrectly asserts that fibroblast is a protein. Also shows some other junk term like ‘a_strong_candidate’ and ‘many_molecules’
OWLViz of final ontology. Shows t-lymphocyte is-a lymphocyte is-a white blood cell is-a blood cell is cell. Also shows that its still not perfect: fibroblast is a blast cell, which is not actually correct.
Graph of OntoEval results showing gradual improvement of taxonomic recall and precision. Lexical Precision 40% -> 50%. In final condition where we placed it under CTO, rose to 72%.
Same image as previous slide, but showing where we manually inserted our learnt ontology under CTO classes. This image is again only for cell_by_function