Tdwg 2-remsen


Published on

Presentation at TDWG Conference 2011 in New Orleans

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • To start with, GBIF strives to create a global biodiversity data network that facilitates free and open access to primary biodiversity data worldwide. Currently, the network includes over 9200 datasets from over 340 data publishers representing over 100 countries and international organisations. Collectively the network provides access to over 300 million data records.
  • The foundation of the GBIF data network has historically been based on access to biodiversity databases mediated through one of the TDWG protocols listed above. These different protocols support the means to query databases in a standard manner and receive data results formatted according to Darwin Core or ABCD XML specifications.
  • These protocols were designed to support a fully federated network where a user could query the network through a gateway, which would propagate the query to all the members of the network and assemble the resultant responses to the user.
  • The GBIF network, however, was never able to function in this federated role. Real-time querying of databases was hampered by many factors not the least of which was that at any given time up to ¼ of the data servers were offline.
  • As a result the GBIF data portals provide discovery of data through a central index. This index consists of a subset of all the data served through the network that can be used to answer the key questions related to the data store – what species are included, where were they found and when were they collected.
  • DIGIR, TAPIR and BIOCASE are not well suited for building indexes of databases. They require long iterations of queries to harvest an entire dataset. A dataset of 260,000 specimens, served via TAPIR allows 200 records to be retrieved per request. This requires 1300 request/response pairs and takes over 9 hours to compete. During this time 500 MB of XML data is transferred. This is transformed into a 32MB text file once the data are processed in the GBIF server which could have been further compressed to a 3MB zip file. Producing such a data export and zipping it would take under a minute if produced by the database itself. Thus in 2009, GBIF began to promote the use of a new indexing data format.
  • Darwin Core Archives provide Darwin Core-based occurrence and taxonomic data in a simple, text-based format. It simplifies the exchange of indexes by eliminating the use of federated transfer protocols. Data is accessed via a simple URL using HTTP.
  • Darwin Core Archives provide GBIF with the means to 1) reduce what is currently more than a months (or more) time between when a data publisher registers data and its subsequent appearance in the data portal. We anticipate that with increased uptake of Darwin Core Archive and improvements in our data integration processes, we can reduce the latency from approx. a month down to a week or less. In addition, Darwin Core Archive has enabled us to index very large datasets that simply could not be harvested using the federated protocols.
  • Thus, since the Darwin Core Archive standard has been adopted, GBIF has seen a significant increase in the numbers of data records published through the network with a 50% increase in 2011 alone.
  • A second significant issue that challenges effective delivery of biodiversity data in a federated network is due to issues of quality relating to geospatial properties of records.
  • This map shows raw data as harvested from data providers that is asserted to originate in the United States. Note the mirror image of the United States over India and China. This is due to a missing negative symbol in the longitude data value.
  • This is how the data looks like after improved interpretation methods have been applied. We can now recognise international waters and offshore islands.
  • Providing taxonomic access to biodiversity data is a key requirement for many users. Both DarwinCore and ABCD provide the means for data publishers to include the Linnean classification of the referenced species within the data record. In a federated network, the result is that the same taxon may be classified in different ways. Not only does this complicate assembling a common taxonomic backbone for organising indexed data, it also complicates distinguishing actual homonyms – cases where the same name has been applied to two different taxa. In addition scientific names are often misspelled and even a correctly spelled name may exist as many different orthographies.
  • GBIF assembles a taxonomic backbone from taxonomic sources that are more authoritative than the classifications included with collections data. These sources are derived from new capacities within the GBIF network that enable species information to be published through the GBIF network in the same manner as collections (species occurrence) data. The GBIF taxonomic backbone, once assembled from a mix of both authoritative and collections-based classifications, is now composed entirely from published taxonomic catalogue data.
  • An example of how this impacts data organisation and delivery is illustrated in the map above. A european bird species with a name not occurring in the Catalogue of Life was mistakenly placed within the hummingbirds (a new world group) based on classification information tied to some of the specimens. This resulted in the map above where one erroneous species grouping impacts the map for the entire family.
  • With access to a wider array of authoritative taxonomic sources, we are able to match more taxa using more reliable sources and improve the taxonomic backbone used to organise all species data records.
  • This improved taxonomic reconciliation extends to the resolution of homonyms – names for different taxa that are spelled alike. Relying solely on taxonomic information within occurrence data sources provides a confusing array of possible homonyms. Relying on taxonomic authority files reveals there are exactly two genera with this name and includes a common name to help distinguish them.
  • Lastly, informatics improvements complement the addition of authoritative taxonomic sources in providing better methods for matching names to authority files. GBIFs name parsing service parses names into recognised component parts and builds canonical representations of names that allow different forms of the same name to be matched to authority file information.
  • Tdwg 2-remsen

    1. 1. Taxonomic Databases Working Group Annual Meeting 2011 GBIF: Issues in providing federated access to digital information related to biological specimens. David Remsen Senior Programme Officer Global Biodiversity Information Facility (GBIF) TDWG 2011
    2. 2. Issue #1: The consequences of scale <ul><li>Goal – Provide timely access to a large federated network of biodiversity databases </li></ul>
    3. 3. About GBIF <ul><li>341 publishers </li></ul><ul><li>9290 datasets </li></ul><ul><li>310M records </li></ul>The mission of the Global Biodiversity Information Facility (GBIF) is to facilitate free and open access to biodiversity data worldwide via the Internet to underpin sustainable development. <ul><li>57 countries </li></ul><ul><li>45 organisations </li></ul>Primary biodiversity data
    4. 4. “ Wrapper ” Software PyWrapper (Python) TAPIR Link (PHP) DiGIR (PHP) Your database Insect Collection Install one of these ‘ wrappers ’ ABCD Bird Observations Herbarium Data DarwinCore DarwinCore
    5. 5. The promise of federation Insect Collection Herbarium Bird Observations Herbarium Any specimens from Thailand? GBIF Data Portal I will ask! I do! I do! I do! Nope! GBIF Data Portal as a Gateway
    6. 6. The challenge of federation Insect Collection Herbarium Bird Observations Herbarium Hello? Server Not Available GBIF Data Portal Hi!
    7. 7. The rise of Indexing Insect Collection Herbarium Bird Observations Herbarium Any data records from Thailand? Send me an index of all of your data GBIF Data Portal (now with Data!) GBIF Data Portal as a Data Index
    8. 8. The wrong tools for the job Insect Collection Herbarium Bird Observations Herbarium Any data records from Thailand? Send me an index of your data once per month Here is page one. If I go offline, s tart again Not too fast! You ask the same questions every time GBIF Data Portal (now with Data!)
    9. 9. Darwin Core Archives A text-based solution to publishing biodiversity data
    10. 10. A Refined Approach Insect Collection Herbarium Bird Observations Herbarium Any data records from Thailand? This is fast! GBIF Data Portal (now with Data!) URL URL URL URL This is easy
    11. 11. 2007 Today 70 million 2010 2008 2009 147 million 180 million 201 million 302 million Growth Need for a new standard identified
    12. 12. Issue #2: Geospatial Integration <ul><li>Goal – Provide accurate reporting of nationally-bound data </li></ul><ul><li>Challenge – Inaccurate recording of geospatial coordinates </li></ul>
    13. 13. Geo-referenced USA data Verbatim data as shared on the network
    14. 14. Issue #2: Geospatial Integration <ul><li>Remediation includes </li></ul><ul><li>Integration of national shapefiles to verify that coordinates fell within country boundaries </li></ul><ul><ul><li>Including EEZ boundaries </li></ul></ul><ul><ul><li>Including islands </li></ul></ul><ul><li>Identified outliers </li></ul><ul><li>Qualified the nature of the error (e.g., “coordinates inverted”) </li></ul><ul><li>Marked and omitted these records from display </li></ul>
    15. 15. Geo-referenced USA data Data following interpretation <ul><ul><li>Coastal regions recognised </li></ul></ul><ul><ul><li>Offshore islands recognised </li></ul></ul>
    16. 16. Issue #3: Taxonomic Integration <ul><li>Goal – Provide access to biodiversity data according to taxonomic groups and concepts </li></ul><ul><li>Challenge – </li></ul><ul><ul><li>Heterogeneous and sometimes inaccurate classification </li></ul></ul><ul><ul><ul><li>Same taxon appearing in different classifications </li></ul></ul></ul><ul><ul><li>Presence of homonyms that complicate reconciling above </li></ul></ul><ul><ul><li>Misspellings </li></ul></ul><ul><ul><li>Wide range of orthographies for the same name. </li></ul></ul>
    17. 17. Enabled taxonomic data to be published through GBIF
    18. 18. Trochilidae (Hummingbirds) (today) Misinterpretations (Hummingbirds are only found in western hemisphere)
    19. 19. Trochilidae (Hummingbirds) (next month) Improved interpretation
    20. 20. Search for Oenanthe ( water dropwort plant or wheatear bird ) Difficult for user to interpret Accurate search results Today Next month
    21. 21. Improved the means to match names
    22. 22. In summary <ul><li>GBIF has had to deploy different data access strategies in order to effectively scale </li></ul><ul><li>Darwin Core Archive offers a scalable solution that has led to rapid growth in data published through GBIF </li></ul><ul><li>Geospatial filtering via shapefiles provides basis for more accurate national reporting </li></ul><ul><ul><li>Basis for additional services later (e.g., ecosystem shapefiles, protected areas, etc.) </li></ul></ul><ul><li>Heterogenous taxonomy inherent to collections data is nearly impossible to consolidate into a taxonomically accurate structure. </li></ul><ul><ul><li>Comprehensive authoritative taxonomic data is a key organisational component of collections data </li></ul></ul>
    23. 23. Thank you