Your SlideShare is downloading. ×
  • Like
IRMNG presentation March 2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

IRMNG presentation March 2012

  • 162 views
Published

 

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
162
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Talk prepared for GN-CoL names and taxonomy sharing workshop, Hawaii, March 2012
  • Hierarchical approach is in contrast to e.g. NameBank, GNI -- Names are not accepted without a parent (even if this is “Animalia (awaiting allocation)” in a few cases) -- Placeholder groups e.g. “Mollusca (awaiting allocation)” are erected at Order and Family level to allow addition of genus names not yet placed to family (for homonymy in particular, also because other details e.g. publication info, extant/fossil status may already be available)
  • Homonomy is a big problem – up to 15% of all genus names are homonyms/isonyms either within or across Codes (including some misspellings which collide with other “good” names) (*Isonyms: multiple publication instances of same name as new, based on same type or concept) Many genus names are valid across more than 1 Code (e.g. used in botany and zoology for different taxa), a handful of genus names are concurrently valid across three Codes as per this example: Lawsonia Worst example currently “ Wagneria ” – 14 instances in IRMNG, 2 valid, the rest are synonyms Cannot disentangle without a master list of genus names
  • IRMNG is a central aggregation point for all such information as readily available from multiple sources, both electronic and print, although the compilation of names / associated nomenclatural info. outstrips the full taxonomic information at this time. Incorporation of “TAXAMATCH” fuzzy matching also permits return of other names differering only slightly from the entered name, in case one of these is in fact the intended target (also permits a degree of data cleaning and reconciliation/dediplication).
  • The IRMNG web query interface also includes information on extant & habitat flags, synonymy (as held), sources of the data, information about parent and child taxa, and so on.
  • Currently IRMNG is structured around Linnaean ranks only i.e. kingdom / phylum / class / order / family / genus / species (no infraspecies are held at this time), may be extended in future. Deprecated records (e.g. duplicates detected during subsequent QA) are left on system with their IRMNG ID intact, in case referred to elsewhere, or require re-activation. Records are flagged as either current (valid) or non-current at the indicated Rank; not yet clear how to handle taxa considered non-current at designated rank, but current at another.
  • - Cat. of Life misses many genus-level synonyms / misspellings recognised elsewhere (including its source DB’s) - Genera not treated as distinct data objects in CoL (unless changed recently) i.e. no authorities, publication info, nomenclatural or taxonomic remarks - Coverage of fossils is considered valuable feature of IRMNG (though no systematic attempt at species ingestion as yet)
  • Many single sources of taxon names - often not integrated - newly published names discoverable only with some effort (although “official” registries/lists for prokaryotes, viruses) - considerable latency as names flow from published (at left) to aggregators (at right)
  • GNI / NameBank approach: collect as many namestrings as possible, any rank - User needs to explore source/s to determine taxonomic hierarchy and other information (if held) - Or: maybe one day, will be offered in a coherent hierarchy/list (but not any time soon)
  • GNI produces a (partial?) list of known orthographies (mix of all ranks) Species and below can generally be eliminated by pattern matching, leaving uninomial names i.e. genera and above (plus authorities), in multiple potential variants Note this suggests that there may be 3 genuinely distinct “Lawsonia” instances known to GNI at this time – although sometimes the situation is more opaque / potentially misleading (similar auth’s but different taxa, different auth’s but the same taxon, or no auth held).
  • Cat. of Life approach: stitch together authoritative lists for global sectors complete to species - Some sectors (30% of all extant taxa) not yet sourced, may have no lists - Information above species level is sketchy (e.g. no genus, family auth’s or other information) - Fossil taxa are omitted at this time
  • Catalogue of Life largely indexes species and infraspecies, genera are presented last with no authorities (although position in hierarchy can be accessed from this page) Note, only 2 “Lawsonia”s held, [at least] another one more somewhere not known to CoL (either missing, or out of scope i.e. fossil)

Transcript

  • 1. www.obis.org.au/irmng IRMNG – the Interim Register of Marine and Nonmarine Genera: rationale and current statusTony Rees – CSIRO Marine and Atmospheric Research, Australiafor: GN-CoL names and taxonomy sharing workshop, Hawaii, March 2012
  • 2. The Dream…Imagine a system that would… • Automatically classify “any” genus & species name to kingdom / phylum / class / order / family (as far down as possible) – “what is this critter” – plus hierarchical relations e.g. parents / children / siblings • Return whether a current (valid) or non-current name e.g. synonym • Check spelling for correctness, also authority details, plus supply original publication ref. as available • Return associated attributes such as extant / fossil status, habitat information, geographic / geologic range, more… • Work seamlessly, with a single point of entry, across all groups and geologic epochs including present day • Be as up-to-date as possible (latest content), and authoritative (maintained by relevant experts) Tony Rees: IRMNG March 2012
  • 3. Realising the Dream… • For extant taxa: role of Cat. of Life, however ~30% of species still to go; for fossil taxa: PaleoDB (unknown proportion missing, maybe 50%?) • In mean time, could make progress by assembling global genera list, and infilling with species names as available genera species • IRMNG is an attempt along these lines… a work in progress, with modest resourcing, but available for use now.Tony Rees: IRMNG March 2012
  • 4. IRMNG data sources• Animal genera + auth’s from Nomenclator Zoologicus and elsewhere, tax. placements and synonymies from multiple sources including CoL, individual taxon treatments and printed works• Botanical genera and auth’s from Index Nominum Genericorum (ING) supplemented with other sources, tax. placements and synonymies from multiple sources including GRIN (APGIII in the main), Index Fungorum, AlgaeBase, CyanoDB, more• Prokaryote genera, auth’s and tax. placements from LSPN (Euzéby list), previous/non-valid names from multiple sources• Virus genera and tax. placements from ICTV db (multiple versions – very different through time)• Species lists (all groups) from CoL 2006, Aphia/WoRMS 2006, AFD, NZ Organisms Register + more.Tony Rees: IRMNG March 2012
  • 5. IRMNG content as at March 2012 (cf. e.g. Cat. of Life): Cat. of Life (2011 version): IRMNG: • 8k families • 19k families • 178k genera • 454k genera • 2.25m species names • 1.46m species names (including synonyms) (including synonyms)• Not all IRMNG genera yet linked to relevant families, but ~370k are (remainder linked to higher taxon i.e. phylum, class or order)• Extant/fossil, marine/nonmarine flags held for majority of names• Nomenclatural status known for most names, tax. status i.e. valid name/synonym for only a subset at this time (varies by group)• Authority known for >97% of genera, publication details for “animal” subset (from Nomenclator Zoologicus in the main)• Fuzzy matching (TAXAMATCH) deployed over all web-based queries for correction of potential errors in input names to be matched.Tony Rees: IRMNG March 2012
  • 6. IRMNG in practice – example genus = “Lawsonia”• Same name is currently a valid genus in 3 Codes i.e. plants, animals and bacteria (no barriers to this) Tony Rees: IRMNG March 2012
  • 7. Required base information is scattered in multiplesystems / printed works at this time plant animal bacterium (etc.)Tony Rees: IRMNG March 2012
  • 8. Required base information is scattered in multiplesystems / printed works at this time plant animal bacterium (etc.)Tony Rees: IRMNG March 2012
  • 9. IRMNG query as at March 2012Tony Rees: IRMNG March 2012
  • 10. IRMNG query as at March 2012 synonym extant, habitat of (as flags known)children parents Tony Rees: IRMNG March 2012
  • 11. Note: IRMNG fields displayed on the web are only a subset of full information held for any name, e.g.:Tony Rees: IRMNG March 2012
  • 12. IRMNG core fields• IRMNG ID, Rank • Extant/fossil, marine/nonmarine• Scientific name (for species: epithet + flags + “according to” (could be “as parent ID) per parent”)• Authority • Date entered, last modified,• Publication (as “microcitation” – subset deprecated (where required) with link to refs. module)• Source(s) for above• Orthography verified against (under consideration…) (authoritative source) • Intermediate ranks e.g. subfamily,• Parent ID (+ “according to…”) – subgenus, also infraspecies (not currently held) Linnaean ranks only at this time • Type genus / species indicator• Nomenclatural status (+ relation with other names as needed) + “according • Freshwater / terrestrial flags vs. to…” present “nonmarine”• Taxonomic status (same) • Geo flags (country codes etc.)• Nomenclatural Code • Palaeo range (periods/epochs)• Taxonomic or nomenclatural remarks • Vernacular names as available Tony Rees: IRMNG March 2012
  • 13. IRMNG is not just a “passive” aggregator…Editorial / curatorial decisions / actions required to:• Correct obvious data errors• Assemble “complete” records from multiple sources (where one source data deficient)• Normalise authority data (in particular) to a “house style”• Digitise or transcribe print material into electronic form where not otherwise available• Decide between conflicting content in data sources e.g. for authority orthography/year, taxonomic placement, valid/synonym status and more• Cross-link names e.g. synonyms -> current names, basionyms -> replacement names, misspelled names to their correctly spelled counterparts, etc. etc.• Reconcile variant higher taxonomies as supplied to a single hierarchy• Add nomenclatural or taxonomic remarks as required. Tony Rees: IRMNG March 2012
  • 14. Relevance to present meeting? • Demonstrates utility of a single entry point to a system permitting query on “any name” – i.e., a [comprehensive] Taxonomic Name Resolution Service (TNRS) covering all life • Envisage something like OBIS or GBIF, but for taxonomy – the aggregator / central query point is not a content author, but provides integration and value-added services • IRMNG – based on static snapshot/s of multiple data sources; cf. a “super catalogue” should be based on live feeds from relevant authoritative sources, continuously updated as available (?+ some static data not available as feeds) • Maybe the static data lives outside the “data aggregation/query” point, becomes a separately managed source • How does / should GNA facilitate this? • Will the need for an IRMNG (or IRMNG equivalent) disappear or grow in the above scenario? (for example could this role be taken by another player or group of players…)Tony Rees: IRMNG March 2012
  • 15. Thank you!Tony Rees: IRMNG March 2012
  • 16. (supplementary slides)Tony Rees: IRMNG March 2012
  • 17. Size of the task: IRMNG 2011 content cf. Cat. of Life 2011 IRMNG – Cat. of Life - % with IRMNG – % with Oct 2011 - 2011 edition auths Oct 2011 - auths fossil only extant + fossil Kingdoms 8 7 (0) Phyla 111 153 (12) Classes 288 509 (64) Orders 1,233 2,645 (715) Families 8,071 0% 19,639 22.1% (6,542) Subfamilies - - - - - Genera 178,515 0% 452,848 97.1% (90,278) Subgenera - - - - - Species (valid) 1,347,224 ~100% 1,020,519 ~100% (16,792) Species (synonyms) 895,441 ~100% 440,738 ~100% (100)• CoL has 70% of valid extant species names (of est. 1.9m total), thus maybe also 70% of valid extant genera (with subset of genus-level synonyms)• IRMNG has further ~180k extant genus names and ~90k fossil names at this time (including syns) – est. ~25k still missingTony Rees: IRMNG March 2012
  • 18. Taxonomic names: what the customer is currently offered (+ more…)publication discovery official taxon-specific integrated DB’s “all names” registers DB’s ICTV Viruses ICTV Viruses ITIS ITIS CyanoDB CyanoDB DB DB NCBI NCBI Taxonomy Taxonomy Index Index WoRMS LPSN LPSN WoRMS Fungorum Fungorum etc. (Prokaryote (Prokaryote etc. MycoBank MycoBank names) names) AlgaeBase AlgaeBase Plant GSD’s Plant GSD’s ICBN Decisions ICBN Decisions New Catalogue ChecklistBank ChecklistBank New Catalogue names of Life of Life names The Plant The Plant GNI GNI publishe publishe List, IPNI, List, IPNI, d (in Journal Journal TROPICOS, TROPICOS, d (in TOC’s, RSS GNUB GNUB TOC’s, RSS ING ING primary primary feeds, Botany feeds,literature) literature) text mining text mining Zoology PaleoDB PaleoDB Animal GSD’s Animal GSD’s Abstracting Abstracting Nomenclator Nomenclator services services Zoologicus Zoologicus Subject Subject ION (Index of Organism ION (Index of Organism bibliographies bibliographies Names) Names) Zoological Zoological Record Record Reviews, Reviews, secondary secondary literature literature ICZN Decisions ICZN Decisions other compilations e.g. regional lists, other compilations e.g. regional lists, Wikispecies, Wikipedia, more… Wikispecies, Wikipedia, more… Tony Rees: IRMNG March 2012
  • 19. Two approaches - GNI and Cat. of Life NameBank / GNI• 20m+ names – all ranks, no hierarchy• mix of “clean” and “dirty” names• many duplicates• extant + fossil, most sectors with atleast some namesTony Rees: IRMNG March 2012
  • 20. GNI search result – “Lawsonia” (allranks returned)(Mar 2012)…candidate genusnames highlighted inred (although couldbe other ranks too)… need access tooriginal taxonomic /nomenclaturalresources to sort out/ see if anythingmissed Tony Rees: IRMNG March 2012
  • 21. Two approaches - GNI and Cat. of Life NameBank / GNI Cat. of Life• 20m+ names – all ranks, no hierarchy • <2m names – Linnaean ranks, in• mix of “clean” and “dirty” names hierarchy• many duplicates • all “clean”/ vetted names / relationships• extant + fossil, most sectors with at • extant only, sectors either complete orleast some names absentTony Rees: IRMNG March 2012
  • 22. Cat. of Life search result – “Lawsonia” (Mar 2012)Tony Rees: IRMNG March 2012