SlideShare a Scribd company logo
iME4d - BiOnym
A concept-mapping workflow for taxon names reconciliation
Friday 7 March 2014 – Rome
A concept-mapping workflow for taxon names reconciliation
Fabio Fiorellato, Edward Vanden Berghe, Gianpaolo Coro, Nicolas Bailly
Big Data make its way to biology
• Data volumes increase dramatically
– Management of large databases (millions of
records) easier
• no longer the realm of professional IT people• no longer the realm of professional IT people
– Biologists wake up to the advantages of
• Good data management, including preservation
• Data sharing
• Makes it possible to do science in a different
way
‘Big Data’: Need for data integration
• Becoming a very realistic possibility
– Management of DBs of millions of records
• Needs integration of small, restricted-scope
datasets into massive databasesdatasets into massive databases
– Intra-discipline integration (homogenous)
– Inter-discipline integration (heterogeneous)
• Individual studies too small to inform on a scale
commensurate with problems humankind faces
– Evidence-based management of living resources
– Climate change, global warming…
iMarine biodiversity ‘ecosystem’
Taxon name enrichment
Taxon name reconciliationTaxon name access
Occurrence data access
Environmental data access
openModeller
AquaMaps
Distribution modelling
Occurrence data enrichment
Occurrence data reconciliation
Central role of taxon name reconciliation
Taxon name enrichment
Taxon name reconciliationTaxon name access
Occurrence data access
Environmental data access
openModeller
AquaMaps
Distribution modelling
Occurrence data enrichment
Occurrence data reconciliation
Taxonomic names are the keys…
• … Keys to bind together information on the
same taxon from different sources
• But there are problems• But there are problems
– Different research groups use different spellings
– Accidental misspellings
– Synonym, homonym reconciliation (but outside
scope of ByOnym)
Some people can’t type
• Asthenognathas inaefaipes
• Asthenognathus inaeqipes
• Asthenognathus maefaipes• Asthenognathus maefaipes
• Astheognathus inaequipes
• Asthenognathus inaeguipes
• Astheognathus inaeqinipes
• Asthenognathus inaequipes
Things can go wrong with Excel…
• Clupea harengus Linnaeus, 1758
• Clupea harengus Linnaeus, 1759
• Clupea harengus Linnaeus, 1760• Clupea harengus Linnaeus, 1760
• Clupea harengus Linnaeus, 1761
• Clupea harengus Linnaeus, 1762
• …
… very wrong
• Clupea harengus Linnaeus, 1758
• Clupea harengus Linnaeus, 1759
• Clupea harengus Linnaeus, 1760• Clupea harengus Linnaeus, 1760
• …
• Clupea harengus Linnaeus, 2254
• Clupea harengus Linnaeus, 2255
Taxonomic names are the keys…
• … Keys to bind together information on the
same taxon from different sources
• But there are problems• But there are problems
– Different research groups use different spellings
– Accidental misspellings
• Reconciliation is necessity, not luxury!!!
Existing systems…
• … Are not flexible
– We need flexibility, as our use case will dictate what the ‘optimal’
behaviour of the system is
• E.g. manual vs automatic systems
• … Are often coupled to a single ‘reference list’• … Are often coupled to a single ‘reference list’
– Using different tax. Scope for test and reference only increases
false positives
• E.g. TaxaMatch with IRMNG…
• …Don’t always have throughput needed for
large-scale projects
– Largest db appr. 20M names – too many pairs!
Our need
• A flexible, highly customisable, workflow-
based approach to taxon name matching
– User controls input
– Output can be used as input in other– Output can be used as input in other
processes
– Running on high performance computing
infrastructure
BiOnym!
Introduction to BiOnym
• As a workflow for taxon name mapping and reconciliation, it is
a real-world application of the concept-mapping principles
• It is focused on the domain of taxonomy, with an initial
restriction to marine species only
• Provides a full workflow (not only the concept mapping part)
• Tries to address - and possibly solve - many issues common to• Tries to address - and possibly solve - many issues common to
the taxonomic community
• Its key concept is “species taxonomy”, where concept
properties are the taxonomic atoms
• Is open to integration from third party components
• Takes advantage of the iMarine distributed infrastructure
The iMarine solution: existing state-of-the-art
• A general purpose concept mapping framework
(COMET) was already available in FAO:
– based on an existing FAO product (limited to the fishing
vessels domain) initially developed with the support of the
Japanese trust fund
– domain independent (can be tailored to any custom– domain independent (can be tailored to any custom
domain with little effort)
– provided with all the necessary building blocks and
components for general purpose usage
The iMarine solution: the quest for integration
• The integration of COMET inside iMarine was hailed
and expected.
• Its main challenges:
– Identify and define the custom domain (biological taxonomy)
– Design and implement:
• custom COMET matchlets (engine assigning similarity scores to pairs of names)
• additional, reusable tools for data interchange and data preparation
(DwCA converter, input parser, pre- and post-processors)
– Enable components to be easily distributed among worker nodes
inside the infrastructure
– Integration in the iMarine Statistical Manager
The iMarine solution: a success story
• The COMET integration inside iMarine, as part of the
BiOnym workflow, is an example of success story:
– Solving the integration challenges required limited effort
• Harvest names for input through iMarine tools
• Send output from BiOnym/COMET on to further tools
– The core matching capabilities of BiOnym were first made– The core matching capabilities of BiOnym were first made
available in June 2013
• Pre- and post-processing; parsing
• Matching through (a series of) matchlets, assigning a similarity
score to pairs of names
– The modular architecture enabled developers to add new
functionalities or improve existing ones with ease
BiOnym key concepts and features
• Its modular architecture is open to contribution and
alternatives
– Workflow stages can be plugged-in with custom business implementations
– Can leverage third party components (e.g. the input data parsing is available
both as an in-house component or as a wrapper of the GNI parser from
globalnames.org)
• Based on standard and open formats• Based on standard and open formats
– Reference data are synthesized from DWCA files
– Input data and matching results are expected and produced as CSV files
– Matching results can also be emitted as XML files in the COMET format
• High flexibility
– Multiple chained matchers, each with its own configuration and thresholds
– Third party matchers (e.g. Tony Rees’ TaxaMatch) can be seamlessly ‘wrapped’
and plugged in the workflow
– Support for collaborative matching results evaluation (expected soon)
BiOnym System: Overview
BiOnym Workflow
Where are we?
• Infrastructure has largely been built
• User-friendly GUI is under development
• Evaluation
– Efficiency: speed of computations– Efficiency: speed of computations
• Parallel system, compares well with others
– Effectiveness: are the results OK?
• Ran experiments on different test datasets
– Deliberately introducing misspellings in known lists
– ‘Real’ misspellings manually corrected for other purposes
The Bionym Interface
Never mind the small print.
Step 1: Select your data
Step 2: Compose the
matching process. This
relies on infrastructure
resources
Step 3: review results. This
can be private and ‘for your
eyes only’, or public.
The BiOnym Workflow
Visualising
quality assessment
of the results of BiOnym
Where to from here?
• Validation
– Not in terms of quality of output but…
– Uptake by the biodiversity community
• Sustainability• Sustainability
– Who will take over maintenance after iMarine
ends?
• BiOnym is a tool, it is the means to an end
– Support Ecosystem Approach to Fisheries
iMarine biodiversity ‘ecosystem’
Taxon name enrichment
Taxon name reconciliationTaxon name access
Occurrence data access
Environmental data access
openModeller
AquaMaps
Distribution modelling
Occurrence data enrichment
Occurrence data reconciliation
BiOnym in its environment
Ecological modelling – Rich data management
Taxa Authority FileTaxa Authority File
Vernacular Names
Authority File
Vernacular Names
Authority File
Darwin Core ArchiveDarwin Core Archive
Based on the COMET Framework
developed by Fabio Fiorellato (FAO)
Biodiversity Maps Generation
Retrieve via any GeoNetwork
Ecological modelling - Processing

More Related Content

Similar to BiOnym

Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
jaumebp
 
Introduction to Digital Preservation
Introduction to Digital PreservationIntroduction to Digital Preservation
Introduction to Digital Preservation
Bill LeFurgy
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica
Jenny Mitcham
 
Climb stateoftheartintro
Climb stateoftheartintroClimb stateoftheartintro
Climb stateoftheartintro
thomasrconnor
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
Kimberley Mitchell
 
CLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchCLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB Launch
Tom Connor
 
Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research data
ARDC
 
Νetworking content repositories to provide meaningful services to users
Νetworking content repositories to provide meaningful services to usersΝetworking content repositories to provide meaningful services to users
Νetworking content repositories to provide meaningful services to users
Nikos Manouselis
 
Expert panel on industrialising microbiomics - with Unilever
Expert panel on industrialising microbiomics - with UnileverExpert panel on industrialising microbiomics - with Unilever
Expert panel on industrialising microbiomics - with Unilever
Eagle Genomics
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniques
Piet J.H. Daas
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
Tao Xie
 
ERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management WebinarERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management Webinar
FAIRDOM
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
Libcorpio
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
Wake Tech BAS
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
infinix8
 
Evaluation Insights to Key Processes of Digital Repositories
Evaluation Insights to Key Processes of Digital RepositoriesEvaluation Insights to Key Processes of Digital Repositories
Evaluation Insights to Key Processes of Digital Repositories
Giannis Tsakonas
 
Vitriol
VitriolVitriol

Similar to BiOnym (20)

Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Andrew Waugh presentation
Andrew Waugh   presentationAndrew Waugh   presentation
Andrew Waugh presentation
 
Andrew waugh
Andrew waughAndrew waugh
Andrew waugh
 
Introduction to Digital Preservation
Introduction to Digital PreservationIntroduction to Digital Preservation
Introduction to Digital Preservation
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica
 
Climb stateoftheartintro
Climb stateoftheartintroClimb stateoftheartintro
Climb stateoftheartintro
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
 
CLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchCLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB Launch
 
Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research data
 
Νetworking content repositories to provide meaningful services to users
Νetworking content repositories to provide meaningful services to usersΝetworking content repositories to provide meaningful services to users
Νetworking content repositories to provide meaningful services to users
 
Expert panel on industrialising microbiomics - with Unilever
Expert panel on industrialising microbiomics - with UnileverExpert panel on industrialising microbiomics - with Unilever
Expert panel on industrialising microbiomics - with Unilever
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniques
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
ERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management WebinarERA CoBioTech Data Management Webinar
ERA CoBioTech Data Management Webinar
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
Evaluation Insights to Key Processes of Digital Repositories
Evaluation Insights to Key Processes of Digital RepositoriesEvaluation Insights to Key Processes of Digital Repositories
Evaluation Insights to Key Processes of Digital Repositories
 
Presentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenbergPresentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenberg
 
Vitriol
VitriolVitriol
Vitriol
 

More from iMarine283644

Discovering the impact of climate change on the marine species, Aquamaps
Discovering the impact of climate change on the marine species, AquamapsDiscovering the impact of climate change on the marine species, Aquamaps
Discovering the impact of climate change on the marine species, Aquamaps
iMarine283644
 
How iMarine fulfils data needs in support of the Ecosystem Approach (EA)
How iMarine fulfils data needs in support of the Ecosystem Approach (EA)How iMarine fulfils data needs in support of the Ecosystem Approach (EA)
How iMarine fulfils data needs in support of the Ecosystem Approach (EA)
iMarine283644
 
A step into the future of iMarine: The iMarine Public-centred Partnership Bus...
A step into the future of iMarine: The iMarine Public-centred Partnership Bus...A step into the future of iMarine: The iMarine Public-centred Partnership Bus...
A step into the future of iMarine: The iMarine Public-centred Partnership Bus...
iMarine283644
 
iMarine achievements: three years and beyond, D. Castelli, CNR-ISTI & iMarine...
iMarine achievements: three years and beyond, D. Castelli, CNR-ISTI & iMarine...iMarine achievements: three years and beyond, D. Castelli, CNR-ISTI & iMarine...
iMarine achievements: three years and beyond, D. Castelli, CNR-ISTI & iMarine...
iMarine283644
 
iMarine data e-infrastructure: Data access, harmonization, analysis, and mana...
iMarine data e-infrastructure: Data access, harmonization, analysis, and mana...iMarine data e-infrastructure: Data access, harmonization, analysis, and mana...
iMarine data e-infrastructure: Data access, harmonization, analysis, and mana...
iMarine283644
 
Chimaera
ChimaeraChimaera
Chimaera
iMarine283644
 
The vulnerable marine ecosystems (VME DB) factsheet workflow
The vulnerable marine ecosystems (VME DB) factsheet workflowThe vulnerable marine ecosystems (VME DB) factsheet workflow
The vulnerable marine ecosystems (VME DB) factsheet workflowiMarine283644
 
The iMarine solutions in support to the ecosystem approach needs
The iMarine solutions in support to the ecosystem approach needsThe iMarine solutions in support to the ecosystem approach needs
The iMarine solutions in support to the ecosystem approach needsiMarine283644
 
iMarine catalogue of services
iMarine catalogue of servicesiMarine catalogue of services
iMarine catalogue of services
iMarine283644
 
I marine achievements the story so far
I marine achievements  the story so farI marine achievements  the story so far
I marine achievements the story so fariMarine283644
 
Integrating Heterogeneous and Distributed Information about Marine Species th...
Integrating Heterogeneous and Distributed Information about Marine Species th...Integrating Heterogeneous and Distributed Information about Marine Species th...
Integrating Heterogeneous and Distributed Information about Marine Species th...
iMarine283644
 
Providing Statistical Algorithms as-a-Service
Providing Statistical Algorithms as-a-ServiceProviding Statistical Algorithms as-a-Service
Providing Statistical Algorithms as-a-Service
iMarine283644
 
iMarine initiative overview
iMarine initiative overviewiMarine initiative overview
iMarine initiative overview
iMarine283644
 
iMarine Products and Services delivery
iMarine Products and Services deliveryiMarine Products and Services delivery
iMarine Products and Services delivery
iMarine283644
 
iMarine exploitation opportunities
iMarine exploitation opportunitiesiMarine exploitation opportunities
iMarine exploitation opportunities
iMarine283644
 
Cool tools and high level experts for fisheries management and knowledge
Cool tools and high level experts for fisheries management and knowledgeCool tools and high level experts for fisheries management and knowledge
Cool tools and high level experts for fisheries management and knowledgeiMarine283644
 
Marine Knowledge Meeting, 11-12 Oct 2012, Brussels: Marine Knowledge 2020: Re...
Marine Knowledge Meeting, 11-12 Oct 2012, Brussels: Marine Knowledge 2020: Re...Marine Knowledge Meeting, 11-12 Oct 2012, Brussels: Marine Knowledge 2020: Re...
Marine Knowledge Meeting, 11-12 Oct 2012, Brussels: Marine Knowledge 2020: Re...
iMarine283644
 
Marine Knowledge Meeting, 11-12 Oct 2012, Brussels: All About iMarine
Marine Knowledge Meeting, 11-12 Oct 2012, Brussels: All About iMarine Marine Knowledge Meeting, 11-12 Oct 2012, Brussels: All About iMarine
Marine Knowledge Meeting, 11-12 Oct 2012, Brussels: All About iMarine
iMarine283644
 

More from iMarine283644 (19)

Discovering the impact of climate change on the marine species, Aquamaps
Discovering the impact of climate change on the marine species, AquamapsDiscovering the impact of climate change on the marine species, Aquamaps
Discovering the impact of climate change on the marine species, Aquamaps
 
How iMarine fulfils data needs in support of the Ecosystem Approach (EA)
How iMarine fulfils data needs in support of the Ecosystem Approach (EA)How iMarine fulfils data needs in support of the Ecosystem Approach (EA)
How iMarine fulfils data needs in support of the Ecosystem Approach (EA)
 
A step into the future of iMarine: The iMarine Public-centred Partnership Bus...
A step into the future of iMarine: The iMarine Public-centred Partnership Bus...A step into the future of iMarine: The iMarine Public-centred Partnership Bus...
A step into the future of iMarine: The iMarine Public-centred Partnership Bus...
 
iMarine achievements: three years and beyond, D. Castelli, CNR-ISTI & iMarine...
iMarine achievements: three years and beyond, D. Castelli, CNR-ISTI & iMarine...iMarine achievements: three years and beyond, D. Castelli, CNR-ISTI & iMarine...
iMarine achievements: three years and beyond, D. Castelli, CNR-ISTI & iMarine...
 
iMarine data e-infrastructure: Data access, harmonization, analysis, and mana...
iMarine data e-infrastructure: Data access, harmonization, analysis, and mana...iMarine data e-infrastructure: Data access, harmonization, analysis, and mana...
iMarine data e-infrastructure: Data access, harmonization, analysis, and mana...
 
Chimaera
ChimaeraChimaera
Chimaera
 
The vulnerable marine ecosystems (VME DB) factsheet workflow
The vulnerable marine ecosystems (VME DB) factsheet workflowThe vulnerable marine ecosystems (VME DB) factsheet workflow
The vulnerable marine ecosystems (VME DB) factsheet workflow
 
Tuna atlas
Tuna atlasTuna atlas
Tuna atlas
 
The iMarine solutions in support to the ecosystem approach needs
The iMarine solutions in support to the ecosystem approach needsThe iMarine solutions in support to the ecosystem approach needs
The iMarine solutions in support to the ecosystem approach needs
 
iMarine catalogue of services
iMarine catalogue of servicesiMarine catalogue of services
iMarine catalogue of services
 
I marine achievements the story so far
I marine achievements  the story so farI marine achievements  the story so far
I marine achievements the story so far
 
Integrating Heterogeneous and Distributed Information about Marine Species th...
Integrating Heterogeneous and Distributed Information about Marine Species th...Integrating Heterogeneous and Distributed Information about Marine Species th...
Integrating Heterogeneous and Distributed Information about Marine Species th...
 
Providing Statistical Algorithms as-a-Service
Providing Statistical Algorithms as-a-ServiceProviding Statistical Algorithms as-a-Service
Providing Statistical Algorithms as-a-Service
 
iMarine initiative overview
iMarine initiative overviewiMarine initiative overview
iMarine initiative overview
 
iMarine Products and Services delivery
iMarine Products and Services deliveryiMarine Products and Services delivery
iMarine Products and Services delivery
 
iMarine exploitation opportunities
iMarine exploitation opportunitiesiMarine exploitation opportunities
iMarine exploitation opportunities
 
Cool tools and high level experts for fisheries management and knowledge
Cool tools and high level experts for fisheries management and knowledgeCool tools and high level experts for fisheries management and knowledge
Cool tools and high level experts for fisheries management and knowledge
 
Marine Knowledge Meeting, 11-12 Oct 2012, Brussels: Marine Knowledge 2020: Re...
Marine Knowledge Meeting, 11-12 Oct 2012, Brussels: Marine Knowledge 2020: Re...Marine Knowledge Meeting, 11-12 Oct 2012, Brussels: Marine Knowledge 2020: Re...
Marine Knowledge Meeting, 11-12 Oct 2012, Brussels: Marine Knowledge 2020: Re...
 
Marine Knowledge Meeting, 11-12 Oct 2012, Brussels: All About iMarine
Marine Knowledge Meeting, 11-12 Oct 2012, Brussels: All About iMarine Marine Knowledge Meeting, 11-12 Oct 2012, Brussels: All About iMarine
Marine Knowledge Meeting, 11-12 Oct 2012, Brussels: All About iMarine
 

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 

BiOnym

  • 1. iME4d - BiOnym A concept-mapping workflow for taxon names reconciliation Friday 7 March 2014 – Rome A concept-mapping workflow for taxon names reconciliation Fabio Fiorellato, Edward Vanden Berghe, Gianpaolo Coro, Nicolas Bailly
  • 2. Big Data make its way to biology • Data volumes increase dramatically – Management of large databases (millions of records) easier • no longer the realm of professional IT people• no longer the realm of professional IT people – Biologists wake up to the advantages of • Good data management, including preservation • Data sharing • Makes it possible to do science in a different way
  • 3. ‘Big Data’: Need for data integration • Becoming a very realistic possibility – Management of DBs of millions of records • Needs integration of small, restricted-scope datasets into massive databasesdatasets into massive databases – Intra-discipline integration (homogenous) – Inter-discipline integration (heterogeneous) • Individual studies too small to inform on a scale commensurate with problems humankind faces – Evidence-based management of living resources – Climate change, global warming…
  • 4. iMarine biodiversity ‘ecosystem’ Taxon name enrichment Taxon name reconciliationTaxon name access Occurrence data access Environmental data access openModeller AquaMaps Distribution modelling Occurrence data enrichment Occurrence data reconciliation
  • 5. Central role of taxon name reconciliation Taxon name enrichment Taxon name reconciliationTaxon name access Occurrence data access Environmental data access openModeller AquaMaps Distribution modelling Occurrence data enrichment Occurrence data reconciliation
  • 6. Taxonomic names are the keys… • … Keys to bind together information on the same taxon from different sources • But there are problems• But there are problems – Different research groups use different spellings – Accidental misspellings – Synonym, homonym reconciliation (but outside scope of ByOnym)
  • 7. Some people can’t type • Asthenognathas inaefaipes • Asthenognathus inaeqipes • Asthenognathus maefaipes• Asthenognathus maefaipes • Astheognathus inaequipes • Asthenognathus inaeguipes • Astheognathus inaeqinipes • Asthenognathus inaequipes
  • 8. Things can go wrong with Excel… • Clupea harengus Linnaeus, 1758 • Clupea harengus Linnaeus, 1759 • Clupea harengus Linnaeus, 1760• Clupea harengus Linnaeus, 1760 • Clupea harengus Linnaeus, 1761 • Clupea harengus Linnaeus, 1762 • …
  • 9. … very wrong • Clupea harengus Linnaeus, 1758 • Clupea harengus Linnaeus, 1759 • Clupea harengus Linnaeus, 1760• Clupea harengus Linnaeus, 1760 • … • Clupea harengus Linnaeus, 2254 • Clupea harengus Linnaeus, 2255
  • 10. Taxonomic names are the keys… • … Keys to bind together information on the same taxon from different sources • But there are problems• But there are problems – Different research groups use different spellings – Accidental misspellings • Reconciliation is necessity, not luxury!!!
  • 11. Existing systems… • … Are not flexible – We need flexibility, as our use case will dictate what the ‘optimal’ behaviour of the system is • E.g. manual vs automatic systems • … Are often coupled to a single ‘reference list’• … Are often coupled to a single ‘reference list’ – Using different tax. Scope for test and reference only increases false positives • E.g. TaxaMatch with IRMNG… • …Don’t always have throughput needed for large-scale projects – Largest db appr. 20M names – too many pairs!
  • 12. Our need • A flexible, highly customisable, workflow- based approach to taxon name matching – User controls input – Output can be used as input in other– Output can be used as input in other processes – Running on high performance computing infrastructure BiOnym!
  • 13. Introduction to BiOnym • As a workflow for taxon name mapping and reconciliation, it is a real-world application of the concept-mapping principles • It is focused on the domain of taxonomy, with an initial restriction to marine species only • Provides a full workflow (not only the concept mapping part) • Tries to address - and possibly solve - many issues common to• Tries to address - and possibly solve - many issues common to the taxonomic community • Its key concept is “species taxonomy”, where concept properties are the taxonomic atoms • Is open to integration from third party components • Takes advantage of the iMarine distributed infrastructure
  • 14. The iMarine solution: existing state-of-the-art • A general purpose concept mapping framework (COMET) was already available in FAO: – based on an existing FAO product (limited to the fishing vessels domain) initially developed with the support of the Japanese trust fund – domain independent (can be tailored to any custom– domain independent (can be tailored to any custom domain with little effort) – provided with all the necessary building blocks and components for general purpose usage
  • 15. The iMarine solution: the quest for integration • The integration of COMET inside iMarine was hailed and expected. • Its main challenges: – Identify and define the custom domain (biological taxonomy) – Design and implement: • custom COMET matchlets (engine assigning similarity scores to pairs of names) • additional, reusable tools for data interchange and data preparation (DwCA converter, input parser, pre- and post-processors) – Enable components to be easily distributed among worker nodes inside the infrastructure – Integration in the iMarine Statistical Manager
  • 16. The iMarine solution: a success story • The COMET integration inside iMarine, as part of the BiOnym workflow, is an example of success story: – Solving the integration challenges required limited effort • Harvest names for input through iMarine tools • Send output from BiOnym/COMET on to further tools – The core matching capabilities of BiOnym were first made– The core matching capabilities of BiOnym were first made available in June 2013 • Pre- and post-processing; parsing • Matching through (a series of) matchlets, assigning a similarity score to pairs of names – The modular architecture enabled developers to add new functionalities or improve existing ones with ease
  • 17. BiOnym key concepts and features • Its modular architecture is open to contribution and alternatives – Workflow stages can be plugged-in with custom business implementations – Can leverage third party components (e.g. the input data parsing is available both as an in-house component or as a wrapper of the GNI parser from globalnames.org) • Based on standard and open formats• Based on standard and open formats – Reference data are synthesized from DWCA files – Input data and matching results are expected and produced as CSV files – Matching results can also be emitted as XML files in the COMET format • High flexibility – Multiple chained matchers, each with its own configuration and thresholds – Third party matchers (e.g. Tony Rees’ TaxaMatch) can be seamlessly ‘wrapped’ and plugged in the workflow – Support for collaborative matching results evaluation (expected soon)
  • 20. Where are we? • Infrastructure has largely been built • User-friendly GUI is under development • Evaluation – Efficiency: speed of computations– Efficiency: speed of computations • Parallel system, compares well with others – Effectiveness: are the results OK? • Ran experiments on different test datasets – Deliberately introducing misspellings in known lists – ‘Real’ misspellings manually corrected for other purposes
  • 21. The Bionym Interface Never mind the small print. Step 1: Select your data Step 2: Compose the matching process. This relies on infrastructure resources Step 3: review results. This can be private and ‘for your eyes only’, or public.
  • 24. Where to from here? • Validation – Not in terms of quality of output but… – Uptake by the biodiversity community • Sustainability• Sustainability – Who will take over maintenance after iMarine ends? • BiOnym is a tool, it is the means to an end – Support Ecosystem Approach to Fisheries
  • 25. iMarine biodiversity ‘ecosystem’ Taxon name enrichment Taxon name reconciliationTaxon name access Occurrence data access Environmental data access openModeller AquaMaps Distribution modelling Occurrence data enrichment Occurrence data reconciliation
  • 26. BiOnym in its environment Ecological modelling – Rich data management Taxa Authority FileTaxa Authority File Vernacular Names Authority File Vernacular Names Authority File Darwin Core ArchiveDarwin Core Archive Based on the COMET Framework developed by Fabio Fiorellato (FAO)
  • 27. Biodiversity Maps Generation Retrieve via any GeoNetwork Ecological modelling - Processing