Ontology Learning From Text Using Text2Onto

Ontology Learning From Text?
Robert Stevens
BioHealth Informatics Group
School of Computer Science
University of Manchester
Robert.stevens@manchester.ac.uk

Introduction
• Can we use ontology learning to build
ontologies?
• Not text-mining research, but ontology
research
• What is ontology learning from text?
• The questions we posed
• The experiment we performed
• The results we obtained
• The conclusions we made

Ontology learning
• Text2Onto: http://ontoware
.org/projects/text2onto/
• “The erythrocytes are the blood cells that carry
oxygen to others cells in the body”
• “Lymphocytes, leukocytes, monocytes, phagocytes
and granulocytes are all kinds of white blood cell”
• “These experiments show that the individual
hemopoietic stem cell is a multipotent cell and can
give rise to the complete range of blood cell types,
both myeloid and lymphoid, as well as new stem cells
like itself.”

Ontology Learning
Blood Cell
Erythrocyte
White Blood Cell
Monocyte
Leukocyte
Lymphocyte
Phagocyte
Granulocyte
Multipotent Stem Cell
Hemopoietic Stem Cell
arise from

Text to Ontology “Workflow”
Corpus
Tokenising /
Sentence splitting
Part-Of-Speech
(POS) tagging
Lemmatizing /
Stemming
JAPE transducer
annotates corpus
Text2Onto Algorithms for
extracting modeling primitive
Text2Onto
meta-ontology
Promotion to
OWL ontology

Extracting Patterns from Text
“CFU-S is a blood stem cell”
CFU-S[NNP] is[VBN] a[DT] blood[NN] stem[NN] cell[NN]
Sentence:
Part of Speech (POS) Tagging:
Pseudo JAPE rule:
Any series of nouns (A) followed by the string “ is a ”
followed by series of nouns (B)
Key: NN=noun; DT=determiner; NNP=proper noun; VBN = verb past participle.
Ontological assertions:
A and B are concepts, A is a subclass of B

Some Text2Onto Instances
• Instance: Astrocyte_c
– typeOf: Concept that
– Fact: confidence VALUE 1.0
Instance: AstrocycteNerveCell
TypeOf: Subclass that
Fact: domain VALUE NerveCell and
FACT: Range VALUE Astrocyte and
Fact: confidence VALUE 1.0

The Questions We Asked
• Can we press the button and get a
good ontology?
• If not, can we get something useful?
• Can we do it without having to write too
many rules?
• Does the end-point act as as a donor or
recipient ontology?

Strategy
• Collect corpus
• Manually markup text for cells: Definitive list
of terms
• Process corpus through T2O
• Analyse output of T2O for recall and precision
of terms and hierarchy
• Iteration of previous two step with variants in
rules
• Evaluation against CTO gold standard

The Experimental Conditions
• Default T2O
• T2O plus cell specific JAPE rules and all
algorithms
• Only cell specific JAPE rules,
/EntropyExtraction Algorithm and some
“hierarchy spotting” based on term
composition
• Same 3, but with
VerticalRelationsConceptClassification to
include our simple JAPE rules
• Same 4, but with WordConceptClassificaiton

Rules for Extracting Cell
Types
• Words ending in ‘cyte’, ‘blast’, ‘cell’, ‘glia’, ‘glium’, ‘cell type’, ‘cell line’
and ‘cell lineage’ (together with their plurals)
• Zero or more adjectives followed by zero or more nouns or proper
nouns followed by a ‘cell word’ (together with plural) e.g. ‘renshaw cell’,
‘Muller cell’, ‘immature blood cell’, etc..
• Any stem cell term is a stem cell
• Any term ending with ‘progeneitor cell’ is a Progenitor Cell.
• Any term ending with ‘precursor cell’ is a Precursor Cell.
• Any term ending in ‘blast’ is a Blast Cell.
• Any term ending with ‘cyte’ or ‘cell’ is a Differentiated Cell.

Evaluation Strategy
• Extraction performance
• Ontology evaluation
• Domain coverage
• Expert evaluation

Term Recognition
• 1,277 terms in our definitive list
• 16,384 terms from whole corpus; 625
relevant
• Increase to 17,851 and 916
• All 118 CTO terms in corpus recalled
• Corpus has anatomical bias
• Simple rules exploit regularity of language
• Many false positives from adjective noun rule

Cell Terms
• Morphology: Stellate cell; columnar cell;
• Ploidy
• Maturity: Tetrapooil cell; multiploid cell;
• Potentiality
• Lineage: Totipotent stem cell; multipotent cell;
• Species origin
• Anatomical location: Animal cell; human sell;
• Developmental stage: Mitotic cell; S-phase cell;
• Lineage: Mesoderm cell;

Common errors
Manually
extracted from
corpus
Automatically
extracted from
corpus
Comments
+t - cell Symbols not handled very well
contains cell False -positive cell type
Foam cell New cell type extracted
leukocyte leucocyte Spelling errors in corpus
naïve cell nave cell Character encoding problem
Spermatogonia No rule to extract

Final learnt ontology
Still not perfect!

Discussion
• Exploiting poor performance to focus learning
• Exploiting regularity of language
• Never really going to find CTO domain
general layer
• Terms highly compositional and conflate axes
• Ask the questions “is it useful?” not “is it
good?”
• Is CTO a good standard?
• The extracted hierarchy was not bad from a
cell biology and ontological point of view

Nascent Methodology
• Form corpus that includes, but is not limited
to scope of target ontology
• Extract terms from corpus
• Filter and massage list of terms to find those
of ontological interest
• Use ontology learning to see what happens
• Inspect and augment rules to recognise and
incorporate into hierarchy
• Iterate Use as donor ontology to transfer
useful bits to recipient ontology

Conclusions
• No;
• Yes;
• Yes;
• Donor

Acknowledgements
• Simon Jupp has done the work
• Jaclyn Bibby MSc Project prototype
• Johanna Volker for help with Text2Onto
• David Shotton for knowledge about cell
biology

Ontology Learning From Text Using Text2Onto

Recommended

Recommended

More Related Content

Similar to Ontology Learning From Text Using Text2Onto

Similar to Ontology Learning From Text Using Text2Onto (20)

More from robertstevens65

More from robertstevens65 (20)

Recently uploaded

Recently uploaded (20)

Ontology Learning From Text Using Text2Onto

Editor's Notes