SlideShare a Scribd company logo
1 of 1
Download to read offline
Fuzzy matching of taxon names for                                                                                                                                                                                              Acropaginula <> Arcopaginula
                                                                                                                                                                                                                                       Meosarmatium <> Neosarmatium
                                                                                                                                                                                                                                                     Peneus <> Penaeus


           biodiversity	informatics	applications
                                                                                                                                                                                                                                                 faveolata <> flaveolata
                                                                                                                                                                                                                                         capricornicus <> capricornensis
                                                                                                                                                                                                                                              abrohlensis <> abrolhensis




           Tony Rees, CSIRO Marine and Atmospheric Research, Australia


          Taxon scientific names are key identifiers in the world of biodiversity, yet for
                                                                                                                                                          TAXAMATCH                                                         TAXAMATCH use cases
            informatics applications they often fail to provide the required cross linkages on                                                            reference implementation                                          A	range	of	use	cases	can	be	envisaged	for	
            account of minor (or not so minor) differences in spelling arising from keying                                                                                                                                  TAXAMATCH, including the following:
                                                                                                                                                          The reference installation of TAXAMATCH
        or phonetic errors, OCR (optical character recognition) and transcription errors,                                                                 is	currently	installed	over	the	IRMNG	(Interim                    •	 Matching	a	(web	or	other)	user’s	entered	
    emendations, gender endings of species epithets, differences in diacritical marks, and more.                                                          Register of Marine and Nonmarine Genera)	                            text	against	stored	biodiversity	information,	
                                                                                                                                                          database	hosted	at	CSIRO	Marine	and	                                 where either the input or stored name
    For example, data on the fish genus Coelorinchus (present “correct” spelling) might be                                                                Atmospheric	Research,	available	via	the	access	                      may	be	misspelled	or	a	variant	spelling
    stored under variant spellings Caelorinchus (previously considered correct), Coelorhinchus,                                                           point www.cmar.csiro.au/datacentre/irmng/,
                                                                                                                                                          which	(at	mid	2009)	contains	over	1.4	million	                    •	 Checking	of	names	on	a	“List	A”	that	
    Coelorhynchus, Caelorhynchus, and so on, while the potential for random or semi-random                                                                                                                                     do	not	match	entries	on	an	equivalent	
    keystroke, OCR or transcription errors is almost limitless. If such potential variant                                                                 species	names	from	the	Catalogue	of	Life	and	
                                                                                                                                                          other	sources,	together	with	over	400,000	genus	                     “List	B”	(but	may	potentially	include	the	
    spellings cannot be reconciled, some or even all of the desired data may not be retrieved.                                                            names.	 TAXAMATCH	is	automatically	invoked	                          same	entities	under	variant	spellings)
    This poster introduces TAXAMATCH, a “fuzzy” or near match algorithm developed                                                                         when single genus + species, or genus queries                     •	 Query	expansion	–	for	distributed	data	
                                                                                                                                                          are	made	so	as	to	display	not	only	exact,	but	                       searches	(where	all	name	variants	can	
    at CSIRO Marine and Atmospheric Research (Australia), with the specific purpose of
                                                                                                                                                          also	any	near	matches	in	the	IRMNG	database,	                        be	indexed	in	advance),	as	would	be	
    providing optimal fuzzy matching for genus and species scientific names in real world                                                                 to	any	user-supplied	input	name.	Figs.	2	and	3	                      applicable	to	(e.g.)	OBIS,	GBIF,	etc.
    situations, and capable of deployment over a remote reference database of spellings                                                                   illustrate how TAXAMATCH will return a match
    deemed correct, or incorporation into any local system to suit a user’s particular needs.                                                             of the correct spelled name “Homo sapiens” in                     •	 Deduplication	of	stored	lists	–	especially	
                                                                                                                                                          response to an incorrectly spelled input name                        those	constructed	by	aggregation	
                                                                                                                                                          “Hombo	sapient”.	Note	that	in	this	instance,	                        of names from multiple sources
 TAXAMATCH operating principles                                                                                                                           operation	of	the	genus	and	species	pre-filters	                   •	 “As	you	type”	spell	correction
                                                                                                                                                          means	that	only	325	of	the	445,004	genera,	
 TAXAMATCH comprises a suite of custom                                        The	custom	filtering	that	has	been	                                         and	31	of	the	1,459,171	species	presently	in	                     •	 Application	in	taxonomic	name	
 filters and tests used in succession on                                      developed	for	TAXAMATCH	at	both	genus	                                      the	reference	database	are	actually	required	                        recognition	software,	e.g.	via	OCR	of	
 genus, species epithet, plus authority where                                 and	species	epithet	levels	comprises:                                       to	be	tested,	which	contributes	significantly	                       scanned	specimen	labels,	or	detection	
 supplied, to return candidate near or “fuzzy”                                                                                                            to	the	relatively	short	execution	time	for	the	                      of taxonomic names in mixed text
                                                                              •	 Genus	and	species	pre-filters, which
 matches in a reference set of taxon names                                                                                                                query	(around	1	to	a	few	seconds	per	input	                          streams	(biological	publications,	etc.)
                                                                                 serve	to	speed	up	the	algorithm	execution	
 to any supplied input name. The actual                                                                                                                   name,	or	less	when	conducted	without	the	web	
                                                                                 by	excluding	names	deemed	to	be	almost	                                                                                                    The	web	accessible	IRMNG	/	TAXAMATCH	
 tests employed include the following:                                                                                                                    interface	and	ancillary	information	presented).
                                                                                 certain	not	to	match	from	being	tested                                                                                                     search entry point also currently supports
 •	 An	exact	match	test,	both	before	                                         •	 Genus	and	species	post-filters, which apply                                                                                                the	input	of	batches	of	up	to	approximately	
    and after minor normalisation                                                a set of rules to assist in the discrimination                                                                                             2,500	genus	names	or	1,200	genus	+	species	
 •	 A	phonetic	match	test,	using	a	custom	                                       of	likely	“true”	from	“false”	near	matches                                                                                                 names	for	automated	checking,	as	shown	in	
    algorithm “tuned” to the characteristics                                                                                                                                                                                Fig.	4,	and	mechanisms	for	checking	larger	
                                                                              •	 A	genus	cosmetic filter, which presents                                                                                                    batches	of	names	can	be	implemented	
    of taxon scientific names                                                    only	a	subset	of	“genus	near	match”	search	                                                                                                via	alternative	mechanisms	as	desired.
 •	 A	custom	“Modified	Damerau-Levenshtein	                                      results	to	the	human	web	interface,	while	
    Distance”	(MDLD)	algorithm	which	looks	for	                                  passing a wide range of genera through
    possible	omitted,	inserted,	substituted	and	                                 to the species stage for further testing
    transposed	characters	and	character	blocks                                •	 A	final	result shaping	stage	(which	can	
 •	 A	modified	n-gram	comparison	of	author	                                      be	switched	out	if	desired),	which	masks	
    names and dates where supplied, including                                    more distant near matches in the presence
    expansion	of	selected	known	abbreviations	                                   of	closer	ones,	but	opens	automatically	to	                              Figure 2: Web accessible IRMNG /
    of author names as appropriate.                                              show	them	when	the	latter	are	absent.                                    TAXAMATCH search entry point
                                                                                                                                                          www.cmar.csiro.au/datacentre/irmng/
                                                                              A	schematic	of	overall	TAXAMATCH	
                                                                              operation	is	shown	in	Fig.	1,	below.


input genus +                                                                              available genus
species (+ auth.)                                                                         + species names
                                                      available                              (+ auth’s)
   parsing and                                        genus names
   normalisation            genus
                            pre-filter
      normalised                                      genus names                                                                                                                                                           Figure 4: Sample IRMNG search result for a batch
                            genus test
      input genus                                     tested
                                                                                                                                                                                                                            of multiple species names to be checked, showing
                            genus                                                                                                                                                                                           option presented for “fuzzy search” on names
                            post-filter                                                                       available species                           Figure 3: Result of above search for the entered                  that do not have an exact match to any current
                                                      genus near                   species                                                                term “Hombo sapient” against the IRMNG database                   target name in the IRMNG database at this time.
                                                      matches                      pre-filter
           normalised                                                              species test               species tested
           input species
                                                                                   species                                                                    Conclusion
                                                 genus                             post-filter
                                                 cosmetic                                                     species near           species                  TAXAMATCH	appears	to	offer	a	good	solution	to	the	problems	of	near	matching	genus	and	/	or	
                                                 filter                                                       matches                authorities              species	scientific	names,	whether	for	matching	users’	misspelled	query	terms	to	correctly	stored	
                                                                                   ranking +                                                                  target	data	(or	vice	versa),	list	cross-matching	or	internal	deduplication,	or	as	a	prototype	web	
                                                                                   result shaping                                    auth.                    accessible	taxonomic	spell	checking	service.	Several	development	areas	for	TAXAMATCH	are	
                    normalised                                                                                                       comparator
                    input authority                                                                                                                           currently	under	active	consideration,	and	interested	potential	users	or	developers	are	
                                                                                                                                                              encouraged	to	contact	the	author	at	the	address	shown	below	or	to	visit	the	
                                                   genus near                                                species near
  Figure 1: Schematic of                           matches displayed                                         matches displayed
                                                                                                                                                              TAXAMATCH	web	page	www.cmar.csiro.au/datacentre/taxamatch.htm.
  TAXAMATCH operation



                                          References                                                                                      Acknowledgements
                                          Rees,	T.	(2008).	TAXAMATCH,	a	“fuzzy”	matching	algorithm	for	taxon	names,	and	potential	        I	thank	Miroslaw	Ryba,	CSIRO	Marine	and	Atmospheric	Research,	   contact:   Tony Rees
                                          applications	in	taxonomic	databases.	TDWG 2008 Annual Conference, Perth, Australia,             for	programming	and	database	assistance,	and	Barbara	Boehmer,	   phone:     +61 3 6232 5318
                                          abstract	and	presentation	available	via	www.tdwg.org/conference2008/program/.                   USA	for	assistance	with	modifying	her	original	Oracle®	          email:     tony.rees@csiro.au
                                                                                                                                          Levenshtein	Distance	implementation	for	TAXAMATCH	use.
                                          Rees,	T.	(2009	in	press).	TAXAMATCH,	an	algorithm	for	near	(‘fuzzy’)	matching	of	                                                                                web        www.cmar.csiro.au/datacentre/
                                          species	scientific	names	in	taxonomic	databases.	Biodiversity Informatics	(submitted).          Photographs	courtesy	of	Karen	Gowlett-Holmes.

                                                                                                                                                                                                                                               Poster	design	by	Lea	Crosswell	–	Communication	Group,	CSIRO	Marine	and	Atmospheric	Research	–	May	2009

More Related Content

Viewers also liked

Cablagem bastidores
Cablagem bastidoresCablagem bastidores
Cablagem bastidoresbrunofig94PT
 
Encontro pdt
Encontro pdtEncontro pdt
Encontro pdtvtoson
 
Vallugola1
Vallugola1Vallugola1
Vallugola1serghiei
 
Entrevista produtor antônio roberto losqui - plasticultura - set.out. 2011
Entrevista produtor antônio roberto losqui - plasticultura - set.out. 2011Entrevista produtor antônio roberto losqui - plasticultura - set.out. 2011
Entrevista produtor antônio roberto losqui - plasticultura - set.out. 2011Agricultura Sao Paulo
 
nph weihnachtsfond-2013-firmenspender
nph weihnachtsfond-2013-firmenspendernph weihnachtsfond-2013-firmenspender
nph weihnachtsfond-2013-firmenspendernph-deutschland
 
Test 2 zone control plan
Test 2 zone control planTest 2 zone control plan
Test 2 zone control planClive Burgess
 
Cablagem bastidores
Cablagem bastidoresCablagem bastidores
Cablagem bastidoresbrunofig94PT
 
Greg mathews something we all need to learn.
Greg mathews something we all need to learn.Greg mathews something we all need to learn.
Greg mathews something we all need to learn.Mlb Pitching
 
Día de la tierra
Día de la tierraDía de la tierra
Día de la tierraJoze15
 
Media essay - Eastenders bullying
Media essay - Eastenders bullyingMedia essay - Eastenders bullying
Media essay - Eastenders bullyingsoniawardg322
 
All instruments of music
All instruments of musicAll instruments of music
All instruments of musicRaanan Amir
 

Viewers also liked (17)

Cablagem bastidores
Cablagem bastidoresCablagem bastidores
Cablagem bastidores
 
Encontro pdt
Encontro pdtEncontro pdt
Encontro pdt
 
Vallugola1
Vallugola1Vallugola1
Vallugola1
 
Uks iosh 32 13
Uks iosh  32  13Uks iosh  32  13
Uks iosh 32 13
 
Entrevista produtor antônio roberto losqui - plasticultura - set.out. 2011
Entrevista produtor antônio roberto losqui - plasticultura - set.out. 2011Entrevista produtor antônio roberto losqui - plasticultura - set.out. 2011
Entrevista produtor antônio roberto losqui - plasticultura - set.out. 2011
 
New baby needs
New baby needsNew baby needs
New baby needs
 
nph weihnachtsfond-2013-firmenspender
nph weihnachtsfond-2013-firmenspendernph weihnachtsfond-2013-firmenspender
nph weihnachtsfond-2013-firmenspender
 
DipTech
DipTechDipTech
DipTech
 
Test 2 zone control plan
Test 2 zone control planTest 2 zone control plan
Test 2 zone control plan
 
Cablagem bastidores
Cablagem bastidoresCablagem bastidores
Cablagem bastidores
 
Greg mathews something we all need to learn.
Greg mathews something we all need to learn.Greg mathews something we all need to learn.
Greg mathews something we all need to learn.
 
Circle Flier
Circle FlierCircle Flier
Circle Flier
 
Día de la tierra
Día de la tierraDía de la tierra
Día de la tierra
 
Media essay - Eastenders bullying
Media essay - Eastenders bullyingMedia essay - Eastenders bullying
Media essay - Eastenders bullying
 
All instruments of music
All instruments of musicAll instruments of music
All instruments of music
 
FLP Digital - Apresentação
FLP Digital - Apresentação FLP Digital - Apresentação
FLP Digital - Apresentação
 
Transport and Automotive
Transport and AutomotiveTransport and Automotive
Transport and Automotive
 

Recently uploaded

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Recently uploaded (20)

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

Tony Rees: TAXAMATCH poster May 2009

  • 1. Fuzzy matching of taxon names for Acropaginula <> Arcopaginula Meosarmatium <> Neosarmatium Peneus <> Penaeus biodiversity informatics applications faveolata <> flaveolata capricornicus <> capricornensis abrohlensis <> abrolhensis Tony Rees, CSIRO Marine and Atmospheric Research, Australia Taxon scientific names are key identifiers in the world of biodiversity, yet for TAXAMATCH TAXAMATCH use cases informatics applications they often fail to provide the required cross linkages on reference implementation A range of use cases can be envisaged for account of minor (or not so minor) differences in spelling arising from keying TAXAMATCH, including the following: The reference installation of TAXAMATCH or phonetic errors, OCR (optical character recognition) and transcription errors, is currently installed over the IRMNG (Interim • Matching a (web or other) user’s entered emendations, gender endings of species epithets, differences in diacritical marks, and more. Register of Marine and Nonmarine Genera) text against stored biodiversity information, database hosted at CSIRO Marine and where either the input or stored name For example, data on the fish genus Coelorinchus (present “correct” spelling) might be Atmospheric Research, available via the access may be misspelled or a variant spelling stored under variant spellings Caelorinchus (previously considered correct), Coelorhinchus, point www.cmar.csiro.au/datacentre/irmng/, which (at mid 2009) contains over 1.4 million • Checking of names on a “List A” that Coelorhynchus, Caelorhynchus, and so on, while the potential for random or semi-random do not match entries on an equivalent keystroke, OCR or transcription errors is almost limitless. If such potential variant species names from the Catalogue of Life and other sources, together with over 400,000 genus “List B” (but may potentially include the spellings cannot be reconciled, some or even all of the desired data may not be retrieved. names. TAXAMATCH is automatically invoked same entities under variant spellings) This poster introduces TAXAMATCH, a “fuzzy” or near match algorithm developed when single genus + species, or genus queries • Query expansion – for distributed data are made so as to display not only exact, but searches (where all name variants can at CSIRO Marine and Atmospheric Research (Australia), with the specific purpose of also any near matches in the IRMNG database, be indexed in advance), as would be providing optimal fuzzy matching for genus and species scientific names in real world to any user-supplied input name. Figs. 2 and 3 applicable to (e.g.) OBIS, GBIF, etc. situations, and capable of deployment over a remote reference database of spellings illustrate how TAXAMATCH will return a match deemed correct, or incorporation into any local system to suit a user’s particular needs. of the correct spelled name “Homo sapiens” in • Deduplication of stored lists – especially response to an incorrectly spelled input name those constructed by aggregation “Hombo sapient”. Note that in this instance, of names from multiple sources TAXAMATCH operating principles operation of the genus and species pre-filters • “As you type” spell correction means that only 325 of the 445,004 genera, TAXAMATCH comprises a suite of custom The custom filtering that has been and 31 of the 1,459,171 species presently in • Application in taxonomic name filters and tests used in succession on developed for TAXAMATCH at both genus the reference database are actually required recognition software, e.g. via OCR of genus, species epithet, plus authority where and species epithet levels comprises: to be tested, which contributes significantly scanned specimen labels, or detection supplied, to return candidate near or “fuzzy” to the relatively short execution time for the of taxonomic names in mixed text • Genus and species pre-filters, which matches in a reference set of taxon names query (around 1 to a few seconds per input streams (biological publications, etc.) serve to speed up the algorithm execution to any supplied input name. The actual name, or less when conducted without the web by excluding names deemed to be almost The web accessible IRMNG / TAXAMATCH tests employed include the following: interface and ancillary information presented). certain not to match from being tested search entry point also currently supports • An exact match test, both before • Genus and species post-filters, which apply the input of batches of up to approximately and after minor normalisation a set of rules to assist in the discrimination 2,500 genus names or 1,200 genus + species • A phonetic match test, using a custom of likely “true” from “false” near matches names for automated checking, as shown in algorithm “tuned” to the characteristics Fig. 4, and mechanisms for checking larger • A genus cosmetic filter, which presents batches of names can be implemented of taxon scientific names only a subset of “genus near match” search via alternative mechanisms as desired. • A custom “Modified Damerau-Levenshtein results to the human web interface, while Distance” (MDLD) algorithm which looks for passing a wide range of genera through possible omitted, inserted, substituted and to the species stage for further testing transposed characters and character blocks • A final result shaping stage (which can • A modified n-gram comparison of author be switched out if desired), which masks names and dates where supplied, including more distant near matches in the presence expansion of selected known abbreviations of closer ones, but opens automatically to Figure 2: Web accessible IRMNG / of author names as appropriate. show them when the latter are absent. TAXAMATCH search entry point www.cmar.csiro.au/datacentre/irmng/ A schematic of overall TAXAMATCH operation is shown in Fig. 1, below. input genus + available genus species (+ auth.) + species names available (+ auth’s) parsing and genus names normalisation genus pre-filter normalised genus names Figure 4: Sample IRMNG search result for a batch genus test input genus tested of multiple species names to be checked, showing genus option presented for “fuzzy search” on names post-filter available species Figure 3: Result of above search for the entered that do not have an exact match to any current genus near species term “Hombo sapient” against the IRMNG database target name in the IRMNG database at this time. matches pre-filter normalised species test species tested input species species Conclusion genus post-filter cosmetic species near species TAXAMATCH appears to offer a good solution to the problems of near matching genus and / or filter matches authorities species scientific names, whether for matching users’ misspelled query terms to correctly stored ranking + target data (or vice versa), list cross-matching or internal deduplication, or as a prototype web result shaping auth. accessible taxonomic spell checking service. Several development areas for TAXAMATCH are normalised comparator input authority currently under active consideration, and interested potential users or developers are encouraged to contact the author at the address shown below or to visit the genus near species near Figure 1: Schematic of matches displayed matches displayed TAXAMATCH web page www.cmar.csiro.au/datacentre/taxamatch.htm. TAXAMATCH operation References Acknowledgements Rees, T. (2008). TAXAMATCH, a “fuzzy” matching algorithm for taxon names, and potential I thank Miroslaw Ryba, CSIRO Marine and Atmospheric Research, contact: Tony Rees applications in taxonomic databases. TDWG 2008 Annual Conference, Perth, Australia, for programming and database assistance, and Barbara Boehmer, phone: +61 3 6232 5318 abstract and presentation available via www.tdwg.org/conference2008/program/. USA for assistance with modifying her original Oracle® email: tony.rees@csiro.au Levenshtein Distance implementation for TAXAMATCH use. Rees, T. (2009 in press). TAXAMATCH, an algorithm for near (‘fuzzy’) matching of web www.cmar.csiro.au/datacentre/ species scientific names in taxonomic databases. Biodiversity Informatics (submitted). Photographs courtesy of Karen Gowlett-Holmes. Poster design by Lea Crosswell – Communication Group, CSIRO Marine and Atmospheric Research – May 2009