Tony Rees: TAXAMATCH poster May 2009


Published on

TAXAMATCH (fuzzy matching for scientifc names of organisms) poster presented at e-Biosphere conference, London, May 2009

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Tony Rees: TAXAMATCH poster May 2009

  1. 1. Fuzzy matching of taxon names for Acropaginula <> Arcopaginula Meosarmatium <> Neosarmatium Peneus <> Penaeus biodiversity informatics applications faveolata <> flaveolata capricornicus <> capricornensis abrohlensis <> abrolhensis Tony Rees, CSIRO Marine and Atmospheric Research, Australia Taxon scientific names are key identifiers in the world of biodiversity, yet for TAXAMATCH TAXAMATCH use cases informatics applications they often fail to provide the required cross linkages on reference implementation A range of use cases can be envisaged for account of minor (or not so minor) differences in spelling arising from keying TAXAMATCH, including the following: The reference installation of TAXAMATCH or phonetic errors, OCR (optical character recognition) and transcription errors, is currently installed over the IRMNG (Interim • Matching a (web or other) user’s entered emendations, gender endings of species epithets, differences in diacritical marks, and more. Register of Marine and Nonmarine Genera) text against stored biodiversity information, database hosted at CSIRO Marine and where either the input or stored name For example, data on the fish genus Coelorinchus (present “correct” spelling) might be Atmospheric Research, available via the access may be misspelled or a variant spelling stored under variant spellings Caelorinchus (previously considered correct), Coelorhinchus, point, which (at mid 2009) contains over 1.4 million • Checking of names on a “List A” that Coelorhynchus, Caelorhynchus, and so on, while the potential for random or semi-random do not match entries on an equivalent keystroke, OCR or transcription errors is almost limitless. If such potential variant species names from the Catalogue of Life and other sources, together with over 400,000 genus “List B” (but may potentially include the spellings cannot be reconciled, some or even all of the desired data may not be retrieved. names. TAXAMATCH is automatically invoked same entities under variant spellings) This poster introduces TAXAMATCH, a “fuzzy” or near match algorithm developed when single genus + species, or genus queries • Query expansion – for distributed data are made so as to display not only exact, but searches (where all name variants can at CSIRO Marine and Atmospheric Research (Australia), with the specific purpose of also any near matches in the IRMNG database, be indexed in advance), as would be providing optimal fuzzy matching for genus and species scientific names in real world to any user-supplied input name. Figs. 2 and 3 applicable to (e.g.) OBIS, GBIF, etc. situations, and capable of deployment over a remote reference database of spellings illustrate how TAXAMATCH will return a match deemed correct, or incorporation into any local system to suit a user’s particular needs. of the correct spelled name “Homo sapiens” in • Deduplication of stored lists – especially response to an incorrectly spelled input name those constructed by aggregation “Hombo sapient”. Note that in this instance, of names from multiple sources TAXAMATCH operating principles operation of the genus and species pre-filters • “As you type” spell correction means that only 325 of the 445,004 genera, TAXAMATCH comprises a suite of custom The custom filtering that has been and 31 of the 1,459,171 species presently in • Application in taxonomic name filters and tests used in succession on developed for TAXAMATCH at both genus the reference database are actually required recognition software, e.g. via OCR of genus, species epithet, plus authority where and species epithet levels comprises: to be tested, which contributes significantly scanned specimen labels, or detection supplied, to return candidate near or “fuzzy” to the relatively short execution time for the of taxonomic names in mixed text • Genus and species pre-filters, which matches in a reference set of taxon names query (around 1 to a few seconds per input streams (biological publications, etc.) serve to speed up the algorithm execution to any supplied input name. The actual name, or less when conducted without the web by excluding names deemed to be almost The web accessible IRMNG / TAXAMATCH tests employed include the following: interface and ancillary information presented). certain not to match from being tested search entry point also currently supports • An exact match test, both before • Genus and species post-filters, which apply the input of batches of up to approximately and after minor normalisation a set of rules to assist in the discrimination 2,500 genus names or 1,200 genus + species • A phonetic match test, using a custom of likely “true” from “false” near matches names for automated checking, as shown in algorithm “tuned” to the characteristics Fig. 4, and mechanisms for checking larger • A genus cosmetic filter, which presents batches of names can be implemented of taxon scientific names only a subset of “genus near match” search via alternative mechanisms as desired. • A custom “Modified Damerau-Levenshtein results to the human web interface, while Distance” (MDLD) algorithm which looks for passing a wide range of genera through possible omitted, inserted, substituted and to the species stage for further testing transposed characters and character blocks • A final result shaping stage (which can • A modified n-gram comparison of author be switched out if desired), which masks names and dates where supplied, including more distant near matches in the presence expansion of selected known abbreviations of closer ones, but opens automatically to Figure 2: Web accessible IRMNG / of author names as appropriate. show them when the latter are absent. TAXAMATCH search entry point A schematic of overall TAXAMATCH operation is shown in Fig. 1, below.input genus + available genusspecies (+ auth.) + species names available (+ auth’s) parsing and genus names normalisation genus pre-filter normalised genus names Figure 4: Sample IRMNG search result for a batch genus test input genus tested of multiple species names to be checked, showing genus option presented for “fuzzy search” on names post-filter available species Figure 3: Result of above search for the entered that do not have an exact match to any current genus near species term “Hombo sapient” against the IRMNG database target name in the IRMNG database at this time. matches pre-filter normalised species test species tested input species species Conclusion genus post-filter cosmetic species near species TAXAMATCH appears to offer a good solution to the problems of near matching genus and / or filter matches authorities species scientific names, whether for matching users’ misspelled query terms to correctly stored ranking + target data (or vice versa), list cross-matching or internal deduplication, or as a prototype web result shaping auth. accessible taxonomic spell checking service. Several development areas for TAXAMATCH are normalised comparator input authority currently under active consideration, and interested potential users or developers are encouraged to contact the author at the address shown below or to visit the genus near species near Figure 1: Schematic of matches displayed matches displayed TAXAMATCH web page TAXAMATCH operation References Acknowledgements Rees, T. (2008). TAXAMATCH, a “fuzzy” matching algorithm for taxon names, and potential I thank Miroslaw Ryba, CSIRO Marine and Atmospheric Research, contact: Tony Rees applications in taxonomic databases. TDWG 2008 Annual Conference, Perth, Australia, for programming and database assistance, and Barbara Boehmer, phone: +61 3 6232 5318 abstract and presentation available via USA for assistance with modifying her original Oracle® email: Levenshtein Distance implementation for TAXAMATCH use. Rees, T. (2009 in press). TAXAMATCH, an algorithm for near (‘fuzzy’) matching of web species scientific names in taxonomic databases. Biodiversity Informatics (submitted). Photographs courtesy of Karen Gowlett-Holmes. Poster design by Lea Crosswell – Communication Group, CSIRO Marine and Atmospheric Research – May 2009