Kew at the pro-iBiosphere data hackathon

•Download as PPTX, PDF•

3 likes•1,426 views

This document discusses using a rules-based data linking tool to connect disparate biodiversity data sources. It proposes applying the tool to link (1) plant names in floras to the International Plant Names Index, (2) cited type specimens in IPNI to actual specimen records, (3) flora accounts to herbarium specimens, and (4) duplicate specimen records between herbaria collections. The tool transforms and matches fields in tabular datasets using customizable rules to identify relationships between entities from different sources.

Technology

Kew at pro-iBiosphere
data hackathon
Nicky Nicolson, Matt Blissett
RBG Kew Biodiversity Informatics team

A map + data + tools = links
Two minute background: what we’ve done, why we
should link up our data
What is needed?
- Persistent identifiers
- Tools – to turn “strings” into “things”
What we’ve brought along:
- Map
- Data
- ... Labelled with persistent identifiers
- A rules based matching / linking tool

specimens.kew.org/herbarium/K000525802
doi: 10.1007/s12225-010-9210-7

Cited in:
Rakotoarinivo M, Dransfield J. 2010
New species of Dypsis and Ravenea
(Arecaceae) from Madagascar. Kew
Bull. 65, 279–303.
doi:10.1007/s12225-010-9210-7
specimens.kew.org/herbarium/K000525802

Data linking tool
Rules based
Armed with a tabular dataset, you:
Define zero or more transformers for each field
Define how fields must match
This is a match configuration.

Examples of transformers
Epithet
mediterraneum → mediterranea
NormaliseDiacrits
Déségl. → Desegl.
RemoveBracketedText, RomanNumeral
cix (1892), 57 → 109 57
CleanedPubAuthors
(L.) A.Gray in Hook.f. → A.Gray
SurnameExtracter
(A.Gray) A.Heller → (Gray) Heller
PageExtractor
37(4): 412 (1977) → 412

Examples of matchers
Exact
CommonTokens
CapitalLetters
in Beitr. Aethiop. → B A
Beitr. Fl. Aethiop. → B F A = 0.67 ratio
Number
Integer
Levenshtein

Using the matcher
A configured match can run against any tabular dataset.
Accessible as:
- JSON web service
- Google Refine reconciliation service (work in
progress)
Transformers can be dropped into Google Refine

Proposal: link names in floras to
IPNI
We’ll set up the tool with IPNI as its backend dataset
We run lists of taxa treated in floras against it and
distribute IPNI IDs for these names.
Short term gain: navigate via the IPNI ID to the
evidence about the name – protologues (Rod has
matched 120K to DOIs) and types.
Long term gain: GSPC target #1 – online world flora.
Simpler to integrate data if we’re talking about the
same name.

Proposal – link IPNI to types
We set up the tool with a botanical specimen catalogue
as its backend data-source.
We link up the IPNI cited type data with the specimens
themselves.

Proposal – link floras to
specimens
Floras use herbarium specimens as evidence for their
distribution statements.
We set up the tool with a botanical specimen catalogue
as its backend data-source.
We extract specimen references from floras and run
these against the tool to create links from flora
accounts to specimens themselves.

Cited in: FZ volume:5 part:3 (2003) Rubiaceae by D.M.Bridson &
B.Verdcourt
specimens.kew.org/herbarium/K000049118

Proposal – link duplicates
between herbaria
We set up the tool with a botanical specimen catalogue
e.g. K as its backend data-source.
We fire specimen data from another specimen
catalogue at it to look for duplicates.
Benefits:
- Geo-referencing
- Imaging
- Data capture efficiency

n.nicolson@kew.org
@nickynicolson
m.blissett@kew.org

ChemSpider was developed with the intention of aggregating and indexing available sources of chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge. There are many tens of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them. Despite the diversity of databases available online their inherent quality, accuracy and completeness is lacking in many regards. ChemSpider was established to provide a platform whereby the chemistry community could contribute to cleaning up the data, improving the quality of data online and expanding the information available to include data such as reaction syntheses, analytical data and experimental properties. ChemSpider has now grown into a database of well over 20 million chemical substances integrated with over 300 disparate data sources, many of these directly supporting the Life Sciences. This presentation will provide an overview of our efforts to improve the quality of data online, to provide a foundation for the semantic web for chemistry and to provide access to a set online tools and services to support access to these data. I will also discuss how ChemSpider is being used to enhance Semantic Publishing in Chemistry at RSC.

A very simplistic presentation on current Big Data challenge in bioinformatics. A case on study using one of the computational methods for drug discovery is presented. Cost for development of a new drug is increasing dramatically every year along with challenges associated with it. The big data approach in drug discovery is penetrating slowly, but with a constant speed. We believe effective use of big data would be highly beneficial for taking several crucial dicision during the complete drug discovery process. A discussion on data management using Hadoop and analysis using R programming package is also discussed.

Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014

Robert Grossman

DataTags, The Tags Toolset, and Dataverse Integration

Michael Bar-Sinai

Laurie Goodman at #SSPBoston: Article+Data+ToolsReproducibility, Reuse, & Ra...

GigaScience, BGI Hong Kong

Reusable Software and Open Data To Optimize Agriculture

David LeBauer

Abstract: Humans need a secure and sustainable food supply, and science can help. We have an opportunity to transform agriculture by combining knowledge of organisms and ecosystems to engineer ecosystems that sustainably produce food, fuel, and other services. The challenge is that the information we have. Measurements, theories, and laws found in publications, notebooks, measurements, software, and human brains are difficult to combine. We homogenize, encode, and automate the synthesis of data and mechanistic understanding in a way that links understanding at different scales and across domains. This allows extrapolation, prediction, and assessment. Reusable components allow automated construction of new knowledge that can be used to assess, predict, and optimize agro-ecosystems. Developing reusable software and open-access databases is hard, and examples will illustrate how we use the Predictive Ecosystem Analyzer (PEcAn, pecanproject.org), the Biofuel Ecophysiological Traits and Yields database (BETYdb, betydb.org), and ecophysiological crop models to predict crop yield, decide which crops to plant, and which traits can be selected for the next generation of data driven crop improvement. A next step is to automate the use of sensors mounted on robots, drones, and tractors to assess plants in the field. The TERRA Reference Phenotyping Platform (TERRA-Ref, terraref.github.io) will provide an open access database and computing platform on which researchers can use and develop tools that use sensor data to assess and manage agricultural and other terrestrial ecosystems. TERRA-Ref will adopt existing standards and develop modular software components and common interfaces, in collaboration with researchers from iPlant, NEON, AgMIP, USDA, rOpenSci, ARPA-E, many scientists and industry partners. Our goal is to advance science by enabling efficient use, reuse, exchange, and creation of knowledge. --- Invited talk for the "Informatics for Reproducibility in Earth and Environmental Science Research" session at the American Geophysical Union Fall Meeting, Dec 17 2015.

Amman Workshop - Overview - M MacKayBioversity International

DataStarR: A Data Sharing and Publication Infrastructure to Support Research

IAALD Community

ITWS Capstone (RPI, Fall 2013)

Rensselaer Polytechnic Institute

Building a Network of Interoperable and Independently Produced Linked and Ope...

Michel Dumontier

Over 15 years ago, Sir Tim Berners Lee proclaimed the founding of an exciting new future involving intelligent agents operating over smarter data in order to perform complex tasks at the behest of their human controllers. At the heart of this vision lies an uneasy alliance between tedious formal knowledge representations and powerful analytics over big, but often messy data. Bio2RDF, our decade old open source project to create Linked Data for the life sciences, has weaved emergent Semantic Web technologies such as ontologies and Linked Data to generate FAIR - Findable, Accessible, Interoperable, and Reusable - data in the form of billions of machine accessible statements for use in downstream biomedical discovery. This revolution in data publication has been strengthened by action from global bioinformatics institutions such as the NCBI, NCBO, EBI, and DBCLS. Notably, NCBI's PubChem has successfully coupled large scale data integration with community-based standards to offer a remakable biochemical knowledge resource amenable to data hungry discovery tools. Yet, in the face of increasing pressure from researchers, funders, and publishers, will these approaches be sufficient for growing and maintaining a comprehensive knowledge graph that is inclusive of all biomedical research?

Why are we still doing industrial age drugSean Ekins

Getting Started With Kaggle Dataset

Sankha Subhra Mondal

Data Science for the Win

Michel Dumontier

2013 nas-ehs-data-integration-dcc.titus.brown

Challenges in developing names services - RDA

nickyn

Rda p5-env-plenary-nnnickyn

829 tdwg-2015-nicolson-kew-strings-to-things

nickyn

names-backbone-graph-TDWGnickyn

GrBio Workshop talkRoderic Page

Building a names backbonenickyn

Kaiso: Modeling Complex Class Hierarchies with Neo4j - David Szotten @ GraphC...

Neo4j

In this talk David will summarize business and technical use cases and introduce Kaiso. He will give a basic overview of how to use it, along with some examples of how one might use it to model complex class hierarchies. This will include some interactive code demonstrations. David will explore the main design goals of the project, the current state of the project, and take a look at what’s ahead on Kaiso’s roadmap.

IBC FAIR Data Prototype Implementation slideshow

Mark Wilkinson

D paul ecn2013ECNOfficer

What's hot

Dgpg college kanpur_2015

Puneet Kacker

Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014

Robert Grossman

DataTags, The Tags Toolset, and Dataverse Integration

Michael Bar-Sinai

Laurie Goodman at #SSPBoston: Article+Data+ToolsReproducibility, Reuse, & Ra...

GigaScience, BGI Hong Kong

Reusable Software and Open Data To Optimize Agriculture

David LeBauer

Amman Workshop - Overview - M MacKayBioversity International

DataStarR: A Data Sharing and Publication Infrastructure to Support Research

IAALD Community

ITWS Capstone (RPI, Fall 2013)

Rensselaer Polytechnic Institute

Building a Network of Interoperable and Independently Produced Linked and Ope...

Michel Dumontier

Why are we still doing industrial age drugSean Ekins

Getting Started With Kaggle Dataset

Sankha Subhra Mondal

Data Science for the Win

Michel Dumontier

2013 nas-ehs-data-integration-dcc.titus.brown

What's hot (13)

Dgpg college kanpur_2015

Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014

DataTags, The Tags Toolset, and Dataverse Integration

Laurie Goodman at #SSPBoston: Article+Data+ToolsReproducibility, Reuse, & Ra...

Reusable Software and Open Data To Optimize Agriculture

Amman Workshop - Overview - M MacKay

DataStarR: A Data Sharing and Publication Infrastructure to Support Research

ITWS Capstone (RPI, Fall 2013)

Building a Network of Interoperable and Independently Produced Linked and Ope...

Why are we still doing industrial age drug

Getting Started With Kaggle Dataset

Data Science for the Win

2013 nas-ehs-data-integration-dc

Viewers also liked

Challenges in developing names services - RDA

nickyn

Rda p5-env-plenary-nnnickyn

829 tdwg-2015-nicolson-kew-strings-to-things

nickyn

names-backbone-graph-TDWGnickyn

GrBio Workshop talkRoderic Page

Building a names backbonenickyn

Kaiso: Modeling Complex Class Hierarchies with Neo4j - David Szotten @ GraphC...

Neo4j

Viewers also liked (7)

Challenges in developing names services - RDA

Rda p5-env-plenary-nn

829 tdwg-2015-nicolson-kew-strings-to-things

names-backbone-graph-TDWG

GrBio Workshop talk

Building a names backbone

Kaiso: Modeling Complex Class Hierarchies with Neo4j - David Szotten @ GraphC...

Similar to Kew at the pro-iBiosphere data hackathon

IBC FAIR Data Prototype Implementation slideshow

Mark Wilkinson

D paul ecn2013ECNOfficer

Presentation from Code Camp 2017

Mitch Miller

eScience: A Transformed Scientific Method

Duncan Hull

BioThings SDK: a toolkit for building high-performance data APIs in biology

Chunlei Wu

This is from my talk at BOSC 2017. What’s BioThings? We use “BioThings” to refer to objects of any biomedical entity-type represented in the biological knowledge space, such as genes, genetic variants, drugs, chemicals, diseases, etc. BioThings SDK SDK represents “Software Development Kit”. BioThings SDK provides a Python-based toolkit to build high-performance data APIs (or web services) from a single data source or multiple data sources. It has the particular focus on building data APIs for biomedical-related entities, a.k.a “BioThings”, though it’s not necessarily limited to the biomedical scope. For any given “BioThings” type, BioThings SDK helps developers to aggregate annotations from multiple data sources, and expose them as a clean and high-performance web API.

Talk at OHSU, September 25, 2013

Anita de Waard

euclid_linkedup WWW tutorial (Besnik Fetahu)Besnik Fetahu

Bioinformatics databases: Current Trends and Future Perspectives

University of Malaya

Data is the most powerful resource in any field or subject of study. In Biology, data comes from scientists and their actions, while any institution that makes sense of the data collected, will be in the forefront in their respective research field. In the beginning of any data collection endeavour, it is critical to find proper management techniques to store data and to maximise its utilisation. This presentation reflects upon the current trends and techniques of data modeling, architecture with a highlight on the uses of database, focusing on Bioinformatics examples and case studies. Finally, the future of bioinformatics databases is highlighted to give an overview of the modeling techniques to accommodate the biological data escalation in coming years.

Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

GigaScience, BGI Hong Kong

AELA: An Adaptive Entity Linking Approach

Bianca Pereira

Sharing massive data analysis: from provenance to linked experiment reports

Gaignard Alban

ISMB Workshop 2014

Alejandra Gonzalez-Beltran

On community-standards, data curation and scholarly communication - BITS, Ita...

Susanna-Assunta Sansone

FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...

Mark Wilkinson

This slide deck accompanies the manuscript "Interoperability and FAIRness through a novel combination of Web technologies", submitted to PeerJ Computer Science: https://doi.org/10.7287/peerj.preprints.2522v1 It describes the output of the "Skunkworks" FAIR implementation group, who were tasked with building a prototype infrastructure that would fulfill the FAIR Principles for scholarly data publishing. We show how a novel combination of the Linked Data Platform, RDF Mapping Language (RML) and Triple Pattern Fragments (TPF) can be combined to create a scholarly publishing infrastructure that is markedly interoperable, at both the metadata and the data level. This slide deck (or something close) will be presented at the Dutch Techcenter for Life Sciences Partners Workshop, November 4, 2016. Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R

GARNet workshop on Integrating Large Data into Plant Science

David Johnson

Scott Edmunds: Data Dissemination in the era of "Big-Data"

GigaScience, BGI Hong Kong

SciDataCon - How to increase accessibility and reuse for clinical and persona...

Fiona Nielsen

Presented in session 48 - Sharing of sensitive data - presented by Fiona Nielsen on September 12, 2016 at #SciDataCon http://scidatacon.org We have addressed the most pressing problem for public genomic data, that of data discoverability, by indexing worldwide resources for genomic research data on an online platform (repositive.io) providing a single point of entry to find and access available genomic research data. http://www.scidatacon.org/2016/sessions/48/paper/26/ http://www.scidatacon.org/2016/sessions/48/ International data week - #RDAPlenary #IDW2016

Advancing the International Plant Names Index (IPNI)

nickyn

The "names and taxa" information space is often thought of as being composed of three layers: Taxonomic concepts Code governed nomenclatural acts Name occurrences In many circumstances the distinction of these layers is blurred, leading to confusion and inefficiencies in information management. To date, IPNI has been mainly concerned with the middle layer comprising ICBN governed nomenclatural acts, and is formed of three key components: curated data, information services to expose this data, and dedicated editorial staff to provide nomenclatural expertise. IPNI will be advanced from its current state to better connect to the layers above (taxonomic concepts) and below (name occurrences). This will require the expansion of data holdings, improved linkages, and the development of information services and associated workflows. These will be offered to key actors including name authors, publishers, taxonomists and managers of biodiversity information.

Sw ri sciverse pptcolleeflower22

EMBL Australian Bioinformatics Resource AHM - Data Commons

Vivien Bonazzi

Similar to Kew at the pro-iBiosphere data hackathon (20)

IBC FAIR Data Prototype Implementation slideshow

D paul ecn2013

Presentation from Code Camp 2017

eScience: A Transformed Scientific Method

BioThings SDK: a toolkit for building high-performance data APIs in biology

Talk at OHSU, September 25, 2013

euclid_linkedup WWW tutorial (Besnik Fetahu)

Bioinformatics databases: Current Trends and Future Perspectives

Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

AELA: An Adaptive Entity Linking Approach

Sharing massive data analysis: from provenance to linked experiment reports

ISMB Workshop 2014

On community-standards, data curation and scholarly communication - BITS, Ita...

FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...

GARNet workshop on Integrating Large Data into Plant Science

Scott Edmunds: Data Dissemination in the era of "Big-Data"

SciDataCon - How to increase accessibility and reuse for clinical and persona...

Advancing the International Plant Names Index (IPNI)

Sw ri sciverse ppt

EMBL Australian Bioinformatics Resource AHM - Data Commons

Recently uploaded

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Inflectra

In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring. Learn about: • The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks. • Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective. • Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification. • Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process. Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

DevOps and Testing slides at DASA Connect

Kari Kakkonen

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Product School

Monitoring Java Application Security with JDK Tools and JFR Events

Ana-Maria Mihalceanu

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

Ramesh Iyer

In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.

UiPath Test Automation using UiPath Test Suite series, part 4

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap. The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies. Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques What will you get from this session? 1. Insights into SAP testing best practices 2. Heatmap utilization for testing 3. Optimization of testing processes 4. Demo Topics covered: Execution from the test manager Orchestrator execution result Defect reporting SAP heatmap example with demo Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

Key Trends Shaping the Future of Infrastructure.pdf

Cheryl Hung

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

Generating a custom Ruby SDK for your web service or Rails API using Smithy

g2nightmarescribd

Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.

Bits & Pixels using AI for Good.........

Alison B. Lowndes

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

Accelerate your Kubernetes clusters with Varnish Caching

Thijs Feryn

Leading Change strategies and insights for effective change management pdf 1.pdf

OnBoard

PCI PIN Basics Webinar from the Controlcase Team

ControlCase

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Sri Ambati

Recently uploaded (20)

Assuring Contact Center Experiences for Your Customers With ThousandEyes

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Epistemic Interaction - tuning interfaces to provide information for AI support

DevOps and Testing slides at DASA Connect

FIDO Alliance Osaka Seminar: Overview.pdf

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Monitoring Java Application Security with JDK Tools and JFR Events

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

UiPath Test Automation using UiPath Test Suite series, part 4

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Key Trends Shaping the Future of Infrastructure.pdf

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

Generating a custom Ruby SDK for your web service or Rails API using Smithy

Bits & Pixels using AI for Good.........

UiPath Test Automation using UiPath Test Suite series, part 3

Accelerate your Kubernetes clusters with Varnish Caching

Leading Change strategies and insights for effective change management pdf 1.pdf

PCI PIN Basics Webinar from the Controlcase Team

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Kew at the pro-iBiosphere data hackathon

1. Kew at pro-iBiosphere data hackathon Nicky Nicolson, Matt Blissett RBG Kew Biodiversity Informatics team

2. A map + data + tools = links Two minute background: what we’ve done, why we should link up our data What is needed? - Persistent identifiers - Tools – to turn “strings” into “things” What we’ve brought along: - Map - Data - ... Labelled with persistent identifiers - A rules based matching / linking tool

3. A map + data + tools = links Two minute background: what we’ve done, why we should link up our data What is needed? - Persistent identifiers - Tools – to turn “strings” into “things” What we’ve brought along: - Map - Data - ... Labelled with persistent identifiers - A rules based matching / linking tool

10.

11.

12.

13.

14.

15.

16.

17.

18. specimens.kew.org/herbarium/K000525802 doi: 10.1007/s12225-010-9210-7

19.

20. Cited in: Rakotoarinivo M, Dransfield J. 2010 New species of Dypsis and Ravenea (Arecaceae) from Madagascar. Kew Bull. 65, 279–303. doi:10.1007/s12225-010-9210-7 specimens.kew.org/herbarium/K000525802

21. Data linking tool Rules based Armed with a tabular dataset, you: Define zero or more transformers for each field Define how fields must match This is a match configuration.

22. Examples of transformers Epithet mediterraneum → mediterranea NormaliseDiacrits Déségl. → Desegl. RemoveBracketedText, RomanNumeral cix (1892), 57 → 109 57 CleanedPubAuthors (L.) A.Gray in Hook.f. → A.Gray SurnameExtracter (A.Gray) A.Heller → (Gray) Heller PageExtractor 37(4): 412 (1977) → 412

23. Examples of matchers Exact CommonTokens CapitalLetters in Beitr. Aethiop. → B A Beitr. Fl. Aethiop. → B F A = 0.67 ratio Number Integer Levenshtein

24. Using the matcher A configured match can run against any tabular dataset. Accessible as: - JSON web service - Google Refine reconciliation service (work in progress) Transformers can be dropped into Google Refine

25. Proposal: link names in floras to IPNI We’ll set up the tool with IPNI as its backend dataset We run lists of taxa treated in floras against it and distribute IPNI IDs for these names. Short term gain: navigate via the IPNI ID to the evidence about the name – protologues (Rod has matched 120K to DOIs) and types. Long term gain: GSPC target #1 – online world flora. Simpler to integrate data if we’re talking about the same name.

26. Proposal – link IPNI to types We set up the tool with a botanical specimen catalogue as its backend data-source. We link up the IPNI cited type data with the specimens themselves.

27. Proposal – link floras to specimens Floras use herbarium specimens as evidence for their distribution statements. We set up the tool with a botanical specimen catalogue as its backend data-source. We extract specimen references from floras and run these against the tool to create links from flora accounts to specimens themselves.

28. specimens.kew.org/herbarium/K000049118

29. Cited in: FZ volume:5 part:3 (2003) Rubiaceae by D.M.Bridson & B.Verdcourt specimens.kew.org/herbarium/K000049118

30. Proposal – link duplicates between herbaria We set up the tool with a botanical specimen catalogue e.g. K as its backend data-source. We fire specimen data from another specimen catalogue at it to look for duplicates. Benefits: - Geo-referencing - Imaging - Data capture efficiency

31.

32. n.nicolson@kew.org @nickynicolson m.blissett@kew.org

Kew at the pro-iBiosphere data hackathon

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (7)

Similar to Kew at the pro-iBiosphere data hackathon

Similar to Kew at the pro-iBiosphere data hackathon (20)

Recently uploaded

Recently uploaded (20)

Kew at the pro-iBiosphere data hackathon