Biodiversity informatics fills a space between traditional bioinformatics with its focus on genomes and Ecoinformatics that looks at entire landscapes and their interaction with the physical world. Biodiversity informatics focuses on taxa and their interactions among each other.
Nomenclature and taxonomy plays a central role within the handling of biodiversity information because nearly every piece of information or data related to a species (or more specifically – a taxon) is labeled with a scientific name.
GBIF has a specific focus within biodiversity information in that our scope is restricted to the mobilisation, discovery, and use of primary biodiversity data. Primary biodiversity data are the digital text or multimedia data records that detail the instance of an organism – the ‘what, where, when, how and by whom’ of the organism’s occurrence and recording. One major class of primary biodiversity data is that derived from natural history collections.
A second class of primary biodiversity data originate with observations of species and there are numerous instances of observational data networks that collect millions of species observations every year.
These different classes of biodiversity information are typically stored in databases of some sort that are hosted throughout the world. These databases may contribute to larger networks or act as standalone data access systems. In most cases, the data are made available for access to the Internet through a variety of gateways or portals.
GBIF represents a federated network that is composed of thousands of different primary biodiversity databases located all over the world.
The thing that makes all of these different databases part of the GBIF network are: These data are made available on the Internet using a common set of communications protocols and data formats. A registry, representing a list of all members of the network and the location of the data itself (often a URL) serves as a master network directory.
The registry and communications protocols are utilised to poll each database in the network and retrieve an index of the biodiversity data records they contain. The index includes the key taxonomic, geospatial, and provenance elements of the data record. This allows the data to be visually represented, for instance, on a map of the Earth.
Currently the GBIF index stands at over 310 million records from over 9000 different databases. Each of these data records records the name of the taxon, usually a species, that the record is associated with. The total number of scientific names in this virtual dataset exceeds 6 million different text strings – far exceeding the number of known species. Correctly interpreting this list of names is a key requirement in enabling effective use of the index.
This graph shows the growth of the GBIF occurrence index since 2007.
Before I describe the challenges inherent to the index, I’d like to illustrate how biodiversity data has been used in various scientific and biodiversity policy-related contexts.
In this example, occurrence data from the GBIF network has been geospatially joined with world protected area boundaries to generate provisional species lists and data distribution summaries for the protected area.
Occurrence data has been combined with IUCN species range maps both to validate the distribution and identify potential gaps in coverage.
Species occurrence data is geo-spatially integrated with additional data types such as climatic data to create an ecological profile for the species. Aquamaps uses ecological niche modeling to predict the distribution of marine species.
In the example illustrated here, the model outputs project changes in distribution of a crop species based on possible climate change scenarios.
Researchers at Lancaster University have utilised GBIF data mining tools and occurrence index to extract over 65,000 species names from the US and Worlds Patent indices and determine the distribution of these species among the worlds nations in order to inform Access and Benefit Sharing processes demanded by developing countries as a component of the Convention on Biological Diversity
The uses illustrated here require access to primary biodiversity data that is organised around taxa – either species or higher groups like familes. This organisation is challenged by a number of different factors which I would like to illustrate.
In a federated data environment, specimens may be labeled with different names that refer to the same species. Here is an example of a pair of nomenclatural synonyms that are initially interpreted as distinct taxa and subsequently result in distinct occurrence data maps.
Access to authoritative synonymised species checklists, when properly annotated and interpreted, enable data records labeled with different names to be linked to the same taxon. This clearly impacts the resultant data distribution output and any subsequent uses of these data. A challenge for GBIF has been in 1) gaining access to taxonomic authority files. Until recently the only major taxonomic data source was the Catalogue of Life – a wonderful resource but one that only partially addressed this problem within the GBIF index.
Edward Dickonson mentioned the problem with synonymy in birds and their compilations being scattered among a range of resources. A consequence of this is illustrated here where the Catalogue of Life provides the correct name for the blue tit, it does not include the original combination of the name coined by Linnaeus and as a consequence, misses the majority of occurrences in the index.
Without access to sufficient authoritative taxonomic data, we have been forced to rely on less-accurate classification data originating in occurrence datasets. These datasets often contain errors such as illustrated here where a synonym of a European bird species was mistakenly placed in the hummingbird family. This creates knock-on effects that impact use beyond the single species to the entire family.
With access to a more complete array of authoritative taxonomic sources, we are able to match more taxa and improve the taxonomic backbone used to organise and present species data records.
The lack of a comprehensive multi-regnal nomenclator means that we have no clear indication of the number of homonyms that exist nor a method for determining which classification is ‘correct’ As a result the GBIF index may provide a confusing array of options for a user. Illustrated above is a typical case where we have a number of different Oenanthe but lack sufficient external taxonomic resources to reconcile this number any further.
Access to a wider array of nomenclatural sources reveals there are exactly two genera with this name and includes a common name to help distinguish them.
Difficulties with orthography in scientific names starts at the source. Here are some examples of insect specimen labels that have been transcribed to electronic databases.
It may come as no surprise, therefore to see the sort of variation that may exist in a federated dataset for some of the more complex scientific names. Considerable work has gone into the development of ‘fuzzy matching’ algorithms, notably Tony Rees’ TaxaMatch. But it’s only authoritative nomenclatural sources that can inform us which is the correctly spelled version of the name.
Reconciling orthograpny and nomenclature presents problems beyond simple misspellings. Nomenclatural formats include authorship, infraspecific ranks, and other notation, For a computer, all of these strings represent different names and present challenges to properly organising data records in a federated environment.
Taxonomic name parsing services provide a solution for matching different forms of the same name whenever biodiversity data needs to be integrated from multiple sources. The service atomises name into recognisable constituent parts and reassembles a simplified canonical form that can will be equivalent for the different versions of the name.
These name parsers – combined with authoritative nomenclatural data – extend the utility of this service by providing the raw materials for creating specialised taxonomic name dictionaries.
These dictionaries, combined with software, result in name-mining services that can locate scientific names in literature– on specimen labels – and other full-text publications. It can rapidly and accurately extract all scientific names from large compilations of literature. Such services are employed by the BHL to develop taxonomic indices and by the CBD data mining example I cited earlier
How do we facilitate this?
At GBIF we are working today on extending our architectural framework to serve as a contributor to a Global Names Architecture. A framework that supports the discovery of, and access to, a range of nomenclatural and taxonomic resources. To enable the development of new integrated resources such as a consolidated nomenclatural index that can serve as a core authoritative names dictionary from which different taxonomies may be tied. And to promote the development of name services that enable taxonomy to serve as the core organisational framework for all biodiversity information. Thank you.
Anchoring Biodiversity Information: From Sherborne to the 21 st century and beyond Biodiversity Informatics – GBIFs role in linking information through scientific names. David Remsen Senior Programme Officer Global Biodiversity Information Facility (GBIF) 28 October 2011
Agalinus paupercula borealis Agalinus pauperculum borealis Agalinis paupercula var. Borealis Agalinus pauperculum var. borealis Agalinus paupercula var. borealis Agalinus paupercula var. borealis Pennell Agalinus paupercula Britton var. borealis Pennell Agalinus paupercula (Gray) Britt. var. borealis Pennell Agalinis paupercula (A.Gray) Britton var. borealis Pennell Agalinus paupercula (Gray) Britton var. borealis (Pennell) Zenkert 1934 Issues of Orthography Reconciling different forms of the same name