USING LEXICOGRAPHY TO
CHARACTERISE RELATIONS
BETWEEN SPECIES MENTIONS
IN THE BIODIVERSITY
LITERATURE
SANDRA YOUNG
UNIVERSITY OF BRIGHTON
WHAT DO I MEAN BY SPECIES MENTIONS?
BIODIVERSITY
LITERATURE AND
ONTOLOGIES
• Long-standing example of recorded heritage
• Knowledge representation today favours
ontologies
• Suitable approach for biodiversity?
BIOLOGICAL TAXONOMIES VS SCIENTIFIC
NOMENCLATURE
SCIENTIFIC NOMENCLATURE AMBIGUITY
MELPOMENE
PURPOSE OF RESEARCH
• Adapt lexicography techniques to perform an empirical evaluation of
nomenclature use in a specific corpus
• Use results of this to compare with existing ontologies
LEXICOGRAPHY
• Corpus-based analysis
• Patterns of language use
in context
• Use to define different
meanings of words
• Fluid and flexible,
multiple meanings the
norm
SKETCH ENGINE
– CORPUS
QUERY TOOL
WORD SKETCHES
Extraction of
physical
elements from
the earth
Extraction of
data
METHODS - CORPUS
• Journal of Ecology of Freshwater Fish
• 3.5 million tokens
• Access through University subscription to Wiley
METHODS – ADAPT WORD SKETCHES
CORPUS ANNOTATION
• SCI 1 or 2: Scientific name (as identified through
GNRD an automatic name extraction tool)
• COM: Common species names (perch, trout,
salmon, eel, chub, stickleback, goby, whitefish)
• GENCOLL: general collective terms for species
(insect, species, specie, plant, fish, animal, plant)
• GENPRT: life-stages of species (nymph, parr,
larvae, larva, egg)
SKETCH GRAMMAR ADAPTATION
Original: 1. "(JJ.*|N.*[^Z])" MODIFIER{0,3} NOUN{0,2}
1:NOUN NOT_NOUN
Adapted: 2:SCICOMGEN MODIFIER{0,3} NOUN{0,2}
1:SCICOMGEN NOT_NOUN
WORD SKETCH RELATIONS INCLUDED
METHODS – TRANSFORMATION INTO GRAPHS
RESULTS
• Filtering for frequency
• Filtering for salience
FREQUENCY – OVER 20
FREQUENCY OVER 10
SALIENCE FILTERING
Frequency over 50 Salience over 11
CONCLUSIONS
• Hierarchies identified using adapted Word Sketches
• Filtering options differ in relations highlighted
• Disambiguation of common name usage across scientific nomenclature
NEXT STEPS
• Identify specific characteristics of graph nodes for evaluation purposes
• Automatic evaluation of this data compared to existing ontologies
• Further research into contrasting qualities and uses of frequency versus
salience filtering
• Test on corpora with known ambiguities relating to nomenclature usage
THANK YOU!
ANY QUESTIONS?
Please get in touch:
s.h.young@brighton.ac.uk

Session6 03.sandra young