Kew at the pro-iBiosphere data hackathon

940 views

Published on

Published in: Technology

Kew at the pro-iBiosphere data hackathon

  1. 1. Kew at pro-iBiosphere data hackathon Nicky Nicolson, Matt Blissett RBG Kew Biodiversity Informatics team
  2. 2. A map + data + tools = links Two minute background: what we’ve done, why we should link up our data What is needed? - Persistent identifiers - Tools – to turn “strings” into “things” What we’ve brought along: - Map - Data - ... Labelled with persistent identifiers - A rules based matching / linking tool
  3. 3. A map + data + tools = links Two minute background: what we’ve done, why we should link up our data What is needed? - Persistent identifiers - Tools – to turn “strings” into “things” What we’ve brought along: - Map - Data - ... Labelled with persistent identifiers - A rules based matching / linking tool
  4. 4. specimens.kew.org/herbarium/K000525802 doi: 10.1007/s12225-010-9210-7
  5. 5. Cited in: Rakotoarinivo M, Dransfield J. 2010 New species of Dypsis and Ravenea (Arecaceae) from Madagascar. Kew Bull. 65, 279–303. doi:10.1007/s12225-010-9210-7 specimens.kew.org/herbarium/K000525802
  6. 6. Data linking tool Rules based Armed with a tabular dataset, you: Define zero or more transformers for each field Define how fields must match This is a match configuration.
  7. 7. Examples of transformers Epithet mediterraneum → mediterranea NormaliseDiacrits Déségl. → Desegl. RemoveBracketedText, RomanNumeral cix (1892), 57 → 109 57 CleanedPubAuthors (L.) A.Gray in Hook.f. → A.Gray SurnameExtracter (A.Gray) A.Heller → (Gray) Heller PageExtractor 37(4): 412 (1977) → 412
  8. 8. Examples of matchers Exact CommonTokens CapitalLetters in Beitr. Aethiop. → B A Beitr. Fl. Aethiop. → B F A = 0.67 ratio Number Integer Levenshtein
  9. 9. Using the matcher A configured match can run against any tabular dataset. Accessible as: - JSON web service - Google Refine reconciliation service (work in progress) Transformers can be dropped into Google Refine
  10. 10. Proposal: link names in floras to IPNI We’ll set up the tool with IPNI as its backend dataset We run lists of taxa treated in floras against it and distribute IPNI IDs for these names. Short term gain: navigate via the IPNI ID to the evidence about the name – protologues (Rod has matched 120K to DOIs) and types. Long term gain: GSPC target #1 – online world flora. Simpler to integrate data if we’re talking about the same name.
  11. 11. Proposal – link IPNI to types We set up the tool with a botanical specimen catalogue as its backend data-source. We link up the IPNI cited type data with the specimens themselves.
  12. 12. Proposal – link floras to specimens Floras use herbarium specimens as evidence for their distribution statements. We set up the tool with a botanical specimen catalogue as its backend data-source. We extract specimen references from floras and run these against the tool to create links from flora accounts to specimens themselves.
  13. 13. specimens.kew.org/herbarium/K000049118
  14. 14. Cited in: FZ volume:5 part:3 (2003) Rubiaceae by D.M.Bridson & B.Verdcourt specimens.kew.org/herbarium/K000049118
  15. 15. Proposal – link duplicates between herbaria We set up the tool with a botanical specimen catalogue e.g. K as its backend data-source. We fire specimen data from another specimen catalogue at it to look for duplicates. Benefits: - Geo-referencing - Imaging - Data capture efficiency
  16. 16. n.nicolson@kew.org @nickynicolson m.blissett@kew.org

×