Sherborn: Lyal - Digitising legacy taxonomic literature: processes, products and using the output

708 views

Published on

To date, most digitisation of taxonomic literature has led to a more or less simple digital copy of a paper original – the output has effectively been an electronic copy of a traditional library. While this has increased accessibility of publications through internet access, for many scientific papers the means of indexing and locating them is much the same as with traditional libraries. OCR and born-digital papers allow use of web search engines to locate instances of taxon names and other terms, but OCR efficiency in recognising names is still relatively poor, people’s ability to use search engines effectively is mixed, and many papers cannot be directly searched. Instead of building digital analogues of traditional publications, we should consider what properties we require of future taxonomic information access. Ideally the content of each new digital publication should be accessible in the context of all previous published data, and the user able to retrieve nomenclatural, taxonomic and other data / information in the form required without having to scan all of the original paper and extract target content manually. This opens the door to dynamic linking of new content with extant systems – automatic population and updating of taxonomic catalogues, ZooBank and faunal lists, all descriptions of a taxon and its children instantly accessible with a single search, comparison of classifications used in different publications, and so on. The means to do this is currently marking up content into XML, the more atomised the mark-up the greater the possibilities for data retrieval and integration. Mark-up requires XML that accommodates the required content elements and is interoperable with other XML schemas, and there are now several written to do this, particularly TaxPub, taxonX and taXMLit, the last of these being the most atomised. Building on earlier systems for mark-up of legacy literature ViBRANT is developing a new workflow and seeking to increase the automated component of the process. Manual and automatic data and information retrieval is demonstrated by projects such as INOTAXA and Plazi. As we move to creating and using taxonomic products through the power of the internet, we need to ensure the output, while satisfying the requirements of the Code, is fit for purpose in the future.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
708
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Self - Introduction Will develop some of ideas discussed on Monday by myself, Nick King, Christoph Haeuser and others Check who was there Had intended this for students, so it may be rather simplistic for many of you
  • Sherborn: Lyal - Digitising legacy taxonomic literature: processes, products and using the output

    1. 1. Digitising legacy taxonomic literature: processes, products and using the output   Chris Lyal The Natural History Museum, London
    2. 2. What are we trying to achieve with digitisation? <ul><li>In the first place, </li></ul><ul><li>improving access; </li></ul><ul><li>improving security for originals; </li></ul><ul><li>commercial benefit. </li></ul><ul><li>Digital copies accessible through the internet a means of achieving these objectives </li></ul>
    3. 3. How to find digitised taxonomic information <ul><li>But: </li></ul><ul><li>How does one locate a digital item? </li></ul><ul><ul><li>multiple libraries, not all searchable through Google etc </li></ul></ul><ul><ul><li>search terms sometimes rigid (loss of benefit of library systems like Dewey and browsing) </li></ul></ul><ul><ul><li>unit size being searched for (volume cf author + date) </li></ul></ul><ul><ul><li>Creating digital versions of traditional libraries is a vital step but not maximally efficient </li></ul></ul>
    4. 4. How to find digitised taxonomic information <ul><li>First pass of solutions: </li></ul><ul><ul><li>Expose digital content to web searches </li></ul></ul><ul><ul><li>Enable searches for smaller entities </li></ul></ul><ul><ul><ul><li>Article </li></ul></ul></ul><ul><ul><ul><li>Text element (e.g. treatment) </li></ul></ul></ul><ul><ul><li>Enable searches for key index terms, e.g. author, taxon name </li></ul></ul><ul><ul><ul><li>Author ‘easy’ if publication author (part of metadata); </li></ul></ul></ul><ul><ul><ul><li>if ‘other author’ subject to OCR issues </li></ul></ul></ul><ul><ul><li>Taxon names also cause OCR issues </li></ul></ul>
    5. 5. How to find digitised taxonomic information <ul><li>OCR efficiency relatively poor (SI’s 99.995% rarely reached) </li></ul><ul><ul><li>Can be improved with some techniques (ABLE project) </li></ul></ul><ul><ul><li>Still may require interpretation (author name abbreviations, genus abbreviations etc) </li></ul></ul><ul><ul><li>(Born digital provides much more reliable search) </li></ul></ul>
    6. 7. What do we really want to build? <ul><li>We have been building forward from the past </li></ul><ul><ul><li>digital analogues of traditional publications </li></ul></ul><ul><ul><li>extraction of, e.g., specimen data </li></ul></ul><ul><li>not back from the future </li></ul><ul><ul><li>What properties do we require of future taxonomic information access? </li></ul></ul><ul><ul><li>How do we apply this to legacy literature? </li></ul></ul>
    7. 8. What do we really want to build? <ul><li>Search: </li></ul><ul><ul><li>Requires single search </li></ul></ul><ul><ul><li>Wide range of search terms </li></ul></ul><ul><ul><li>Simple / Boolean search </li></ul></ul><ul><li>Retrieval: </li></ul><ul><ul><li>Article </li></ul></ul><ul><ul><li>Subsets of taxonomic publications </li></ul></ul><ul><ul><li>All descriptions of a taxon and its children </li></ul></ul><ul><ul><li>Repurposable downloads </li></ul></ul><ul><ul><li>Excludes the stuff we don’t want </li></ul></ul>
    8. 9. What do we really want to build? <ul><li>(Some) subsets of Articles and Treatments: </li></ul><ul><ul><li>Hierarchy </li></ul></ul><ul><ul><li>Taxon name + author + date + nomenclatural/taxonomic act </li></ul></ul><ul><ul><li>Original description citation [name][author][date][reference] </li></ul></ul><ul><ul><li>Subsequent taxonomic / nomenclatural changes citation </li></ul></ul><ul><ul><li>Diagnosis, description </li></ul></ul><ul><ul><li>Biological associations </li></ul></ul><ul><ul><li>Specimen data </li></ul></ul><ul><ul><li>Character statements </li></ul></ul>
    9. 10. What do we really want to build? <ul><li>Ideally: </li></ul><ul><li>retrieve such data without manually scanning whole paper and retrieving required data by copying </li></ul>
    10. 11. What do we really want to build?
    11. 12. What do we really want to build? <ul><li>If we can do this, then: </li></ul><ul><li>dynamic linking of new content with extant systems: </li></ul><ul><ul><li>automatic population and updating of taxonomic catalogues, faunal lists, EoL etc </li></ul></ul><ul><ul><li>compare classifications in different publications </li></ul></ul><ul><li>Retrieval of normalised data for re-use </li></ul><ul><ul><li>(cf Endnote or Mendeley) </li></ul></ul><ul><li>Population of ZooBank </li></ul><ul><ul><li>Automatic assessment of availability </li></ul></ul>
    12. 14. What do we really want to build? (ZooBank) <ul><li>The publication is obtainable in numerous identical copies - metadata </li></ul><ul><li>Publication: If non-paper, deposited in at least 5 major publicly-accessible libraries - metadata </li></ul><ul><li>Publication: Not excluded by Article 9 – metadata </li></ul><ul><li>The name is: published using the Latin Alphabet - metadata </li></ul><ul><li>Name: in the case of species-group names, agrees in gender with the genus name – markup = algorithm </li></ul><ul><li>Name: in the case of family-group names, has a permitted ending – markup + list </li></ul><ul><li>Name: in the case of family-group names, has an ending appropriate for the rank given – markup + list + algorithm </li></ul><ul><li>Name: in the case of family-group names, is based on the genus name stated – markup + algorigthm </li></ul><ul><li>Name: not already registered – markup + ZooBank search </li></ul><ul><li>Name: contains more than one letter – markup + algorithm </li></ul><ul><li>Genus in which new species-group name is placed (if applicable) - markup </li></ul><ul><li>The name is not published as a synonym but as a valid name – markup </li></ul><ul><li>Valid genus name on which new family-group name is based - markup </li></ul><ul><li>Type species of new genus-group name (including original combination, author and date): markup </li></ul><ul><li>Description of taxon, or bibliographic reference to a description, is part of publication – markup + algorithm </li></ul>
    13. 15. What do we need to do? <ul><li>User-needs assessment: clarity on what to retrieve </li></ul><ul><li>Overview of necessary system </li></ul>
    14. 16. What do we need to do?
    15. 17. What do we need to do?
    16. 18. What do we need to do? <ul><li>Bibliography </li></ul><ul><li>Author / agent database with synonyms </li></ul><ul><li>Repositories for: </li></ul><ul><ul><li>Original texts (e.g. BHL) </li></ul></ul><ul><ul><li>marked-up texts </li></ul></ul><ul><li>XML markup schema(s) </li></ul><ul><ul><li>mark-up atomisation cf data retrieval and integration </li></ul></ul><ul><li>Links / interoperability between different systems </li></ul><ul><li>Nomenclator (ZooBank) </li></ul><ul><li>Taxonomic databases (linked to ZooBank) </li></ul><ul><li>Effective search system </li></ul>
    17. 19. What do we need to do? <ul><li>Bibliography (TL-2; ViBRANT bibliography of life; CiteBank) </li></ul><ul><ul><li>Library and taxonomic sector standards </li></ul></ul><ul><ul><li>Standard citations </li></ul></ul><ul><ul><li>Abbreviations </li></ul></ul><ul><ul><li>De-duplication </li></ul></ul><ul><ul><li>Location of resource </li></ul></ul><ul><ul><li>free </li></ul></ul><ul><ul><li>open-source </li></ul></ul>
    18. 20. Options, needs and current activities <ul><li>XML schemas: e.g., TaxPub, taxonX and taXMLit </li></ul><ul><li>ViBRANT: </li></ul><ul><ul><li>developing a new workflow for legacy literature </li></ul></ul><ul><ul><li>seeking to increase automated component process </li></ul></ul><ul><ul><li>developing workflow for new literature </li></ul></ul><ul><li>Manual and automatic data retrieval demonstrated by INOTAXA, Plazi </li></ul>
    19. 22. Options, needs and current activities <ul><li>ZooBank: </li></ul><ul><ul><li>need to consider the properties required to be part of a larger picture </li></ul></ul><ul><li>Taxonomic databases: </li></ul><ul><ul><li>fragmented with non-standard terminology and content; </li></ul></ul><ul><ul><li>Catalogue of Life not tailored to this particular vision; </li></ul></ul><ul><ul><li>Need to be standardised for content and properties to be part of a coherent system. </li></ul></ul>
    20. 23. <ul><li>Agree what we want to build </li></ul><ul><ul><li>And what we expect it to deliver </li></ul></ul><ul><li>Identify components </li></ul><ul><li>databases </li></ul><ul><li>data types </li></ul><ul><li>interoperability </li></ul>Prioritise the content

    ×