GBIF Annual Meeting 2011: Issues in Providing Federated Access to Digital Biodiversity Data

•Download as PPT, PDF•

1 like•531 views

David Remsen

Presentation at TDWG Conference 2011 in New Orleans

Technology

Taxonomic Databases Working Group Annual Meeting 2011 GBIF: Issues in providing federated access to digital information related to biological specimens. David Remsen Senior Programme Officer Global Biodiversity Information Facility (GBIF) TDWG 2011

Issue #1: The consequences of scale ,[object Object]

About GBIF ,[object Object],[object Object],[object Object],The mission of the Global Biodiversity Information Facility (GBIF) is to facilitate free and open access to biodiversity data worldwide via the Internet to underpin sustainable development. ,[object Object],[object Object],Primary biodiversity data

“ Wrapper ” Software PyWrapper (Python) TAPIR Link (PHP) DiGIR (PHP) Your database Insect Collection Install one of these ‘ wrappers ’ ABCD Bird Observations Herbarium Data DarwinCore DarwinCore

The promise of federation Insect Collection Herbarium Bird Observations Herbarium Any specimens from Thailand? GBIF Data Portal I will ask! I do! I do! I do! Nope! GBIF Data Portal as a Gateway

The challenge of federation Insect Collection Herbarium Bird Observations Herbarium Hello? Server Not Available GBIF Data Portal Hi!

The rise of Indexing Insect Collection Herbarium Bird Observations Herbarium Any data records from Thailand? Send me an index of all of your data GBIF Data Portal (now with Data!) GBIF Data Portal as a Data Index

The wrong tools for the job Insect Collection Herbarium Bird Observations Herbarium Any data records from Thailand? Send me an index of your data once per month Here is page one. If I go offline, s tart again Not too fast! You ask the same questions every time GBIF Data Portal (now with Data!)

Darwin Core Archives A text-based solution to publishing biodiversity data

A Refined Approach Insect Collection Herbarium Bird Observations Herbarium Any data records from Thailand? This is fast! GBIF Data Portal (now with Data!) URL URL URL URL This is easy

2007 Today 70 million 2010 2008 2009 147 million 180 million 201 million 302 million Growth Need for a new standard identified

Issue #2: Geospatial Integration ,[object Object],[object Object]

Geo-referenced USA data Verbatim data as shared on the network

Issue #2: Geospatial Integration ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Geo-referenced USA data Data following interpretation ,[object Object],[object Object]

Issue #3: Taxonomic Integration ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Enabled taxonomic data to be published through GBIF

Trochilidae (Hummingbirds) (today) Misinterpretations (Hummingbirds are only found in western hemisphere)

Trochilidae (Hummingbirds) (next month) Improved interpretation

Search for Oenanthe ( water dropwort plant or wheatear bird ) Difficult for user to interpret Accurate search results Today Next month

In summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

What's hot

Intro to GBIF: Infrastructures and Platforms for Environmental Crowd Sensing ...Kyle Copas

GBIF BIFA mentoring, Day 4b Event core, July 2016Dag Endresen

GBIF BIFA mentoring, Day 1 GBIF intro, July 2016Dag Endresen

GBIF and reuse of research data, Bergen (2016-12-14)Dag Endresen

Exploring the future of scholarly publishing of biodiversity dataVishwas Chavan

2021-01-27--biodiversity-informatics-gbif-(52slides)Dag Endresen

GBIF and Open ScienceDag Endresen

GBIF towards 2030 (November 2018)Dag Endresen

Intro to GBIF: NBN Crowdsourcing Data Capture SummitKyle Copas

The Global Biodiversity Information Facility and Africa RisingFatima Parker-Allie

European agrobiodioversity, ECPGR network meeting on EURISCO, Central Crop Da...Dag Endresen

Nigel Robinson - ZooBank and Zoological Record: a partnership for successICZN

The role of biodiversity informatics in GBIF, 2021-05-18Dag Endresen

FAIR and open biodiversity collection data managementDag Endresen

Germplasm data exchange, CGIAR SINGER (2009)Dag Endresen

Global Biodiversity Information Facility - 2013Dag Endresen

GBIF and Biodiversity informatics for museums, 15 March 2021Dag Endresen

Museum collections as research data - October 2019Dag Endresen

The Biodiversity Informatics LandscapeVince Smith

COBWEB: Citizen Observatories Web Ecology meets the crowd - Crona Hodges COBWEB Project

What's hot (20)

Intro to GBIF: Infrastructures and Platforms for Environmental Crowd Sensing ...

GBIF BIFA mentoring, Day 4b Event core, July 2016

GBIF BIFA mentoring, Day 1 GBIF intro, July 2016

GBIF and reuse of research data, Bergen (2016-12-14)

Exploring the future of scholarly publishing of biodiversity data

2021-01-27--biodiversity-informatics-gbif-(52slides)

GBIF and Open Science

GBIF towards 2030 (November 2018)

Intro to GBIF: NBN Crowdsourcing Data Capture Summit

The Global Biodiversity Information Facility and Africa Rising

European agrobiodioversity, ECPGR network meeting on EURISCO, Central Crop Da...

Nigel Robinson - ZooBank and Zoological Record: a partnership for success

The role of biodiversity informatics in GBIF, 2021-05-18

FAIR and open biodiversity collection data management

Germplasm data exchange, CGIAR SINGER (2009)

Global Biodiversity Information Facility - 2013

GBIF and Biodiversity informatics for museums, 15 March 2021

Museum collections as research data - October 2019

The Biodiversity Informatics Landscape

COBWEB: Citizen Observatories Web Ecology meets the crowd - Crona Hodges

Viewers also liked

Biodiversity capecod shortDavid Remsen

Remsen sherborneDavid Remsen

Remsen celebration of discoveryDavid Remsen

Tdwg 1-remsenDavid Remsen

Collaboration Forum KeynoteDavid Remsen

Nodes Portal Toolkit PrimerDavid Remsen

Emergent interdisciplinary research opportunity for the MBLDavid Remsen

ASP.Net MVC ile Web Uygulamaları - 1(Giriş)İbrahim ATAY

ASP.Net MVC ile Web Uygulamaları -17(MVCContrib)İbrahim ATAY

Viewers also liked (9)

Biodiversity capecod short

Remsen sherborne

Remsen celebration of discovery

Tdwg 1-remsen

Collaboration Forum Keynote

Nodes Portal Toolkit Primer

Emergent interdisciplinary research opportunity for the MBL

ASP.Net MVC ile Web Uygulamaları - 1(Giriş)

ASP.Net MVC ile Web Uygulamaları -17(MVCContrib)

Similar to GBIF Annual Meeting 2011: Issues in Providing Federated Access to Digital Biodiversity Data

Data editors meeting at SEFSAaike De Wever

GBIF-Norway at NMBU, January 2015Dag Endresen

Chavan Finland 13082009Vishwas Chavan

TDWG at the University of Tasmanialeebel

Remsen EOL Content SummitDavid Remsen

Implementation of Semantic Network Dictionary System for Global Observation ...AIMS (Agricultural Information Management Standards)

Implementation of semantic network dictionary system AIMS (Agricultural Information Management Standards)

Cross-Community User Requirements and the Biodiversity Heritage LibraryChris Freeland

Ecological Society of America Vishwas Chavan

GBIF registry (GBRDS), at European Nodes meeting in Alicante, Spain (10 March...Dag Endresen

GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...Phil Cryer

Gbrd Sworkshop Sept09Vishwas Chavan

2023-05-08 GLIS SAC RomeDag Endresen

PhD defense Julien Troudet (29/11/2017)Julien Troudet

Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong

USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 3Gianpaolo Coro

NHM Data Portal: first steps toward the Graph-of-LifeEdward Baker

NHM Data Portal: first steps toward the Graph-of-LifeVince Smith

National Biodiversity Informatics GoalsDavid Remsen

D paul ecn2013ECNOfficer

Similar to GBIF Annual Meeting 2011: Issues in Providing Federated Access to Digital Biodiversity Data (20)

Data editors meeting at SEFS

GBIF-Norway at NMBU, January 2015

Chavan Finland 13082009

TDWG at the University of Tasmania

Remsen EOL Content Summit

Implementation of Semantic Network Dictionary System for Global Observation ...

Implementation of semantic network dictionary system

Cross-Community User Requirements and the Biodiversity Heritage Library

Ecological Society of America

GBIF registry (GBRDS), at European Nodes meeting in Alicante, Spain (10 March...

GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...

Gbrd Sworkshop Sept09

2023-05-08 GLIS SAC Rome

PhD defense Julien Troudet (29/11/2017)

Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...

USING E-INFRASTRUCTURES FOR BIODIVERSITY CONSERVATION - Module 3

NHM Data Portal: first steps toward the Graph-of-Life

National Biodiversity Informatics Goals

D paul ecn2013

Recently uploaded

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

CloudStudio User manual (basic edition):comworks

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Story boards and shot lists for my a level piececharlottematthew16

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Install Stable Diffusion in windows machinePadma Pradeep

Recently uploaded (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Gen AI in Business - Global Trends Report 2024.pdf

Search Engine Optimization SEO PDF for 2024.pdf

Unleash Your Potential - Namagunga Girls Coding Club

The Future of Software Development - Devin AI Innovative Approach.pdf

Connect Wave/ connectwave Pitch Deck Presentation

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

CloudStudio User manual (basic edition):

Are Multi-Cloud and Serverless Good or Bad?

Powerpoint exploring the locations used in television show Time Clash

Story boards and shot lists for my a level piece

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

My Hashitalk Indonesia April 2024 Presentation

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Ensuring Technical Readiness For Copilot in Microsoft 365

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Vertex AI Gemini Prompt Engineering Tips

Install Stable Diffusion in windows machine

GBIF Annual Meeting 2011: Issues in Providing Federated Access to Digital Biodiversity Data

1. Taxonomic Databases Working Group Annual Meeting 2011 GBIF: Issues in providing federated access to digital information related to biological specimens. David Remsen Senior Programme Officer Global Biodiversity Information Facility (GBIF) TDWG 2011

4. “ Wrapper ” Software PyWrapper (Python) TAPIR Link (PHP) DiGIR (PHP) Your database Insect Collection Install one of these ‘ wrappers ’ ABCD Bird Observations Herbarium Data DarwinCore DarwinCore

5. The promise of federation Insect Collection Herbarium Bird Observations Herbarium Any specimens from Thailand? GBIF Data Portal I will ask! I do! I do! I do! Nope! GBIF Data Portal as a Gateway

6. The challenge of federation Insect Collection Herbarium Bird Observations Herbarium Hello? Server Not Available GBIF Data Portal Hi!

7. The rise of Indexing Insect Collection Herbarium Bird Observations Herbarium Any data records from Thailand? Send me an index of all of your data GBIF Data Portal (now with Data!) GBIF Data Portal as a Data Index

8. The wrong tools for the job Insect Collection Herbarium Bird Observations Herbarium Any data records from Thailand? Send me an index of your data once per month Here is page one. If I go offline, s tart again Not too fast! You ask the same questions every time GBIF Data Portal (now with Data!)

9. Darwin Core Archives A text-based solution to publishing biodiversity data

10. A Refined Approach Insect Collection Herbarium Bird Observations Herbarium Any data records from Thailand? This is fast! GBIF Data Portal (now with Data!) URL URL URL URL This is easy

11. 2007 Today 70 million 2010 2008 2009 147 million 180 million 201 million 302 million Growth Need for a new standard identified

12.

13. Geo-referenced USA data Verbatim data as shared on the network

14.

15.

16.

17. Enabled taxonomic data to be published through GBIF

18. Trochilidae (Hummingbirds) (today) Misinterpretations (Hummingbirds are only found in western hemisphere)

19. Trochilidae (Hummingbirds) (next month) Improved interpretation

20. Search for Oenanthe ( water dropwort plant or wheatear bird ) Difficult for user to interpret Accurate search results Today Next month

21. Improved the means to match names

22.

23. Thank you

Editor's Notes

To start with, GBIF strives to create a global biodiversity data network that facilitates free and open access to primary biodiversity data worldwide. Currently, the network includes over 9200 datasets from over 340 data publishers representing over 100 countries and international organisations. Collectively the network provides access to over 300 million data records.
The foundation of the GBIF data network has historically been based on access to biodiversity databases mediated through one of the TDWG protocols listed above. These different protocols support the means to query databases in a standard manner and receive data results formatted according to Darwin Core or ABCD XML specifications.
These protocols were designed to support a fully federated network where a user could query the network through a gateway, which would propagate the query to all the members of the network and assemble the resultant responses to the user.
The GBIF network, however, was never able to function in this federated role. Real-time querying of databases was hampered by many factors not the least of which was that at any given time up to ¼ of the data servers were offline.
As a result the GBIF data portals provide discovery of data through a central index. This index consists of a subset of all the data served through the network that can be used to answer the key questions related to the data store – what species are included, where were they found and when were they collected.
DIGIR, TAPIR and BIOCASE are not well suited for building indexes of databases. They require long iterations of queries to harvest an entire dataset. A dataset of 260,000 specimens, served via TAPIR allows 200 records to be retrieved per request. This requires 1300 request/response pairs and takes over 9 hours to compete. During this time 500 MB of XML data is transferred. This is transformed into a 32MB text file once the data are processed in the GBIF server which could have been further compressed to a 3MB zip file. Producing such a data export and zipping it would take under a minute if produced by the database itself. Thus in 2009, GBIF began to promote the use of a new indexing data format.
Darwin Core Archives provide Darwin Core-based occurrence and taxonomic data in a simple, text-based format. It simplifies the exchange of indexes by eliminating the use of federated transfer protocols. Data is accessed via a simple URL using HTTP.
Darwin Core Archives provide GBIF with the means to 1) reduce what is currently more than a months (or more) time between when a data publisher registers data and its subsequent appearance in the data portal. We anticipate that with increased uptake of Darwin Core Archive and improvements in our data integration processes, we can reduce the latency from approx. a month down to a week or less. In addition, Darwin Core Archive has enabled us to index very large datasets that simply could not be harvested using the federated protocols.
Thus, since the Darwin Core Archive standard has been adopted, GBIF has seen a significant increase in the numbers of data records published through the network with a 50% increase in 2011 alone.
A second significant issue that challenges effective delivery of biodiversity data in a federated network is due to issues of quality relating to geospatial properties of records.
This map shows raw data as harvested from data providers that is asserted to originate in the United States. Note the mirror image of the United States over India and China. This is due to a missing negative symbol in the longitude data value.
This is how the data looks like after improved interpretation methods have been applied. We can now recognise international waters and offshore islands.
Providing taxonomic access to biodiversity data is a key requirement for many users. Both DarwinCore and ABCD provide the means for data publishers to include the Linnean classification of the referenced species within the data record. In a federated network, the result is that the same taxon may be classified in different ways. Not only does this complicate assembling a common taxonomic backbone for organising indexed data, it also complicates distinguishing actual homonyms – cases where the same name has been applied to two different taxa. In addition scientific names are often misspelled and even a correctly spelled name may exist as many different orthographies.
GBIF assembles a taxonomic backbone from taxonomic sources that are more authoritative than the classifications included with collections data. These sources are derived from new capacities within the GBIF network that enable species information to be published through the GBIF network in the same manner as collections (species occurrence) data. The GBIF taxonomic backbone, once assembled from a mix of both authoritative and collections-based classifications, is now composed entirely from published taxonomic catalogue data.
An example of how this impacts data organisation and delivery is illustrated in the map above. A european bird species with a name not occurring in the Catalogue of Life was mistakenly placed within the hummingbirds (a new world group) based on classification information tied to some of the specimens. This resulted in the map above where one erroneous species grouping impacts the map for the entire family.
With access to a wider array of authoritative taxonomic sources, we are able to match more taxa using more reliable sources and improve the taxonomic backbone used to organise all species data records.
This improved taxonomic reconciliation extends to the resolution of homonyms – names for different taxa that are spelled alike. Relying solely on taxonomic information within occurrence data sources provides a confusing array of possible homonyms. Relying on taxonomic authority files reveals there are exactly two genera with this name and includes a common name to help distinguish them.
Lastly, informatics improvements complement the addition of authoritative taxonomic sources in providing better methods for matching names to authority files. GBIFs name parsing service parses names into recognised component parts and builds canonical representations of names that allow different forms of the same name to be matched to authority file information.

GBIF Annual Meeting 2011: Issues in Providing Federated Access to Digital Biodiversity Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to GBIF Annual Meeting 2011: Issues in Providing Federated Access to Digital Biodiversity Data

Similar to GBIF Annual Meeting 2011: Issues in Providing Federated Access to Digital Biodiversity Data (20)

More from David Remsen

More from David Remsen (12)

Recently uploaded

Recently uploaded (20)

GBIF Annual Meeting 2011: Issues in Providing Federated Access to Digital Biodiversity Data

Editor's Notes