Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Tony Rees: TAXAMATCH poster May 2009
1. Fuzzy matching of taxon names for Acropaginula <> Arcopaginula
Meosarmatium <> Neosarmatium
Peneus <> Penaeus
biodiversity informatics applications
faveolata <> flaveolata
capricornicus <> capricornensis
abrohlensis <> abrolhensis
Tony Rees, CSIRO Marine and Atmospheric Research, Australia
Taxon scientific names are key identifiers in the world of biodiversity, yet for
TAXAMATCH TAXAMATCH use cases
informatics applications they often fail to provide the required cross linkages on reference implementation A range of use cases can be envisaged for
account of minor (or not so minor) differences in spelling arising from keying TAXAMATCH, including the following:
The reference installation of TAXAMATCH
or phonetic errors, OCR (optical character recognition) and transcription errors, is currently installed over the IRMNG (Interim • Matching a (web or other) user’s entered
emendations, gender endings of species epithets, differences in diacritical marks, and more. Register of Marine and Nonmarine Genera) text against stored biodiversity information,
database hosted at CSIRO Marine and where either the input or stored name
For example, data on the fish genus Coelorinchus (present “correct” spelling) might be Atmospheric Research, available via the access may be misspelled or a variant spelling
stored under variant spellings Caelorinchus (previously considered correct), Coelorhinchus, point www.cmar.csiro.au/datacentre/irmng/,
which (at mid 2009) contains over 1.4 million • Checking of names on a “List A” that
Coelorhynchus, Caelorhynchus, and so on, while the potential for random or semi-random do not match entries on an equivalent
keystroke, OCR or transcription errors is almost limitless. If such potential variant species names from the Catalogue of Life and
other sources, together with over 400,000 genus “List B” (but may potentially include the
spellings cannot be reconciled, some or even all of the desired data may not be retrieved. names. TAXAMATCH is automatically invoked same entities under variant spellings)
This poster introduces TAXAMATCH, a “fuzzy” or near match algorithm developed when single genus + species, or genus queries • Query expansion – for distributed data
are made so as to display not only exact, but searches (where all name variants can
at CSIRO Marine and Atmospheric Research (Australia), with the specific purpose of
also any near matches in the IRMNG database, be indexed in advance), as would be
providing optimal fuzzy matching for genus and species scientific names in real world to any user-supplied input name. Figs. 2 and 3 applicable to (e.g.) OBIS, GBIF, etc.
situations, and capable of deployment over a remote reference database of spellings illustrate how TAXAMATCH will return a match
deemed correct, or incorporation into any local system to suit a user’s particular needs. of the correct spelled name “Homo sapiens” in • Deduplication of stored lists – especially
response to an incorrectly spelled input name those constructed by aggregation
“Hombo sapient”. Note that in this instance, of names from multiple sources
TAXAMATCH operating principles operation of the genus and species pre-filters • “As you type” spell correction
means that only 325 of the 445,004 genera,
TAXAMATCH comprises a suite of custom The custom filtering that has been and 31 of the 1,459,171 species presently in • Application in taxonomic name
filters and tests used in succession on developed for TAXAMATCH at both genus the reference database are actually required recognition software, e.g. via OCR of
genus, species epithet, plus authority where and species epithet levels comprises: to be tested, which contributes significantly scanned specimen labels, or detection
supplied, to return candidate near or “fuzzy” to the relatively short execution time for the of taxonomic names in mixed text
• Genus and species pre-filters, which
matches in a reference set of taxon names query (around 1 to a few seconds per input streams (biological publications, etc.)
serve to speed up the algorithm execution
to any supplied input name. The actual name, or less when conducted without the web
by excluding names deemed to be almost The web accessible IRMNG / TAXAMATCH
tests employed include the following: interface and ancillary information presented).
certain not to match from being tested search entry point also currently supports
• An exact match test, both before • Genus and species post-filters, which apply the input of batches of up to approximately
and after minor normalisation a set of rules to assist in the discrimination 2,500 genus names or 1,200 genus + species
• A phonetic match test, using a custom of likely “true” from “false” near matches names for automated checking, as shown in
algorithm “tuned” to the characteristics Fig. 4, and mechanisms for checking larger
• A genus cosmetic filter, which presents batches of names can be implemented
of taxon scientific names only a subset of “genus near match” search via alternative mechanisms as desired.
• A custom “Modified Damerau-Levenshtein results to the human web interface, while
Distance” (MDLD) algorithm which looks for passing a wide range of genera through
possible omitted, inserted, substituted and to the species stage for further testing
transposed characters and character blocks • A final result shaping stage (which can
• A modified n-gram comparison of author be switched out if desired), which masks
names and dates where supplied, including more distant near matches in the presence
expansion of selected known abbreviations of closer ones, but opens automatically to Figure 2: Web accessible IRMNG /
of author names as appropriate. show them when the latter are absent. TAXAMATCH search entry point
www.cmar.csiro.au/datacentre/irmng/
A schematic of overall TAXAMATCH
operation is shown in Fig. 1, below.
input genus + available genus
species (+ auth.) + species names
available (+ auth’s)
parsing and genus names
normalisation genus
pre-filter
normalised genus names Figure 4: Sample IRMNG search result for a batch
genus test
input genus tested
of multiple species names to be checked, showing
genus option presented for “fuzzy search” on names
post-filter available species Figure 3: Result of above search for the entered that do not have an exact match to any current
genus near species term “Hombo sapient” against the IRMNG database target name in the IRMNG database at this time.
matches pre-filter
normalised species test species tested
input species
species Conclusion
genus post-filter
cosmetic species near species TAXAMATCH appears to offer a good solution to the problems of near matching genus and / or
filter matches authorities species scientific names, whether for matching users’ misspelled query terms to correctly stored
ranking + target data (or vice versa), list cross-matching or internal deduplication, or as a prototype web
result shaping auth. accessible taxonomic spell checking service. Several development areas for TAXAMATCH are
normalised comparator
input authority currently under active consideration, and interested potential users or developers are
encouraged to contact the author at the address shown below or to visit the
genus near species near
Figure 1: Schematic of matches displayed matches displayed
TAXAMATCH web page www.cmar.csiro.au/datacentre/taxamatch.htm.
TAXAMATCH operation
References Acknowledgements
Rees, T. (2008). TAXAMATCH, a “fuzzy” matching algorithm for taxon names, and potential I thank Miroslaw Ryba, CSIRO Marine and Atmospheric Research, contact: Tony Rees
applications in taxonomic databases. TDWG 2008 Annual Conference, Perth, Australia, for programming and database assistance, and Barbara Boehmer, phone: +61 3 6232 5318
abstract and presentation available via www.tdwg.org/conference2008/program/. USA for assistance with modifying her original Oracle® email: tony.rees@csiro.au
Levenshtein Distance implementation for TAXAMATCH use.
Rees, T. (2009 in press). TAXAMATCH, an algorithm for near (‘fuzzy’) matching of web www.cmar.csiro.au/datacentre/
species scientific names in taxonomic databases. Biodiversity Informatics (submitted). Photographs courtesy of Karen Gowlett-Holmes.
Poster design by Lea Crosswell – Communication Group, CSIRO Marine and Atmospheric Research – May 2009