Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

  • 7,965 views
Uploaded on

Presentation for the Office of Strategic Initiatives (November 8, 2006) with Suzanne C. Pilsk

Presentation for the Office of Strategic Initiatives (November 8, 2006) with Suzanne C. Pilsk

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
7,965
On Slideshare
7,965
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
39
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Biodiversity Heritage Library: A Conversation About A Collaborative Digitization Project Suzanne C. Pilsk Martin R. Kalfatovic Smithsonian Institution Libraries
  • 2. Biodiversity
    • What is Biodiversity?
    • Genetic variability within species
    • Diversity of species
    • Ecosystems and landscapes
  • 3. Biodiversity
    • Wholesome food
    • Drinkable water
    • Breathable air
    • Stable climate for
      • Forestry
      • Agriculture
      • Fisheries
    • Waste decomposition
    • Bioremediation
    • Invasive species
    • Pest control
    • Ecotourism
    • Pharmaceuticals
    • Genomics
    • Proteomics
    • Bioengineering
    • Biotechnology
    • Molecular design
    • Imitating nature
    • Designer organisms
    • Renewable feedstocks
    • Envirofriendly
    • Manufacturing processes
  • 4. Taxonomic Literature
    • Over 250 years of systematic description of life
    • Systema naturae (10 th ed. 1758) by Carl von Linné
  • 5. Taxonomic Literature The cited half-life of publications in taxonomy is longer than in any other scientific discipline * * * The decay rate is longer than in any scientific discipline - Macro-economic case for open access, Tom Moritz
  • 6. Taxonomic Impediment
    • Specimen collections
    • Databases
    • Publications
    • Observations
    • ‘ Gray’ literature
    • Index cards
    • Field notebooks
  • 7. Taxonomic Impediment Agatea violaris Type specimen from the U.S. National Herbarium (Smithsonian Institution) collected by the United States Exploring Expedition, 1838-1842
  • 8. Taxonomic Impediment
  • 9. Taxonomic Impediment
    • Specimen
    • Plate or other visual image
    • Taxonomic description
  • 10. Taxonomic Literature
    • that there is access to information held in national/regional/global collections
    • that electronic data is efficiently captured and provided in useable form
    • that existing information held in literature and by current experts is made available electronically
    • that stability of scientific names of organisms, used to access this information, is promoted
    • - Darwin Declaration, 1998
    The essential requirements for accessing and utilising this global information are:
  • 11. Taxonomic Impediment Biologia Centrali-Americana. Edited by Frederick Ducane Godman and Osbert Salvin. London : Pub. for the editors by R. H. Porter, 1879-1915
  • 12. Digital Divide?
  • 13. Digital Divide? Vishwas Chavan travels a lot. An informatician based at the National Chemical Laboratory in Pune, India, he collects data on what types of animal live where in India to enter into a biodiversity database … Much of the information Chavan seeks is in old, out-of-print tomes … To find them, Chavan has spent years trailing around libraries. He dreams of the day when books such as these are scanned and made available as digital files on the Internet. “ Science in the Web Age: The Real Death of Print” by Andreas von Bubnoff Nature 438, 550-552 1 December 2005
  • 14. Encyclopedia of Life … imagine for a moment that all the diversity of the world were finally revealed and then described, say one page to a species. The description would contain the scientific name, a photograph or drawing, a brief diagnosis, and information of where the species if found. If published in conventional book form … this Great Encyclopedia of Life would occupy 60 meters of library shelf per million species … 100 million species of organisms … would extend through 6 kilometers of shelving … E.O. Wilson (1992 )
  • 15. Biodiversity Heritage Library
    • 2003, Telluride. Encyclopaedia of Life meeting
    • February 2005, London. Library and Laboratory: the Marriage of Research, Data and Taxonomic Literature
    • May 2005, Washington. Ground work for the Biodiversity Heritage Library
    • June 2006, Washington. Organizational and Technical meeting
    • October 2006, St. Louis/San Francisco. Technical meetings
  • 16. Biodiversity Heritage Library
    • Museums
      • American Museum of Natural History (New York)
      • Field Museum (Chicago)
      • Natural History Museum (London)
      • Smithsonian Institution (Washington)
  • 17. Biodiversity Heritage Library
    • Botanical Gardens
      • Missouri Botanical Garden
      • New York Botanical Garden
      • Royal Botanic Garden, Kew
  • 18. Biodiversity Heritage Library
    • University Libraries
      • Botany Libraries, Harvard University
      • Ernst Meyer Library of the Museum of Comparative Zoology, Harvard University
  • 19. Biodiversity Heritage Library
    • Bioinformatics Member
      • Marine Biological Laboratory / Woods Hole Oceanographic Institution Library (MBL/WHOI)
      • uBio project of MBL/WHOI
  • 20. Biodiversity Heritage Library
    • Affiliated Partner: Internet Archive
  • 21. Biodiversity Heritage Library
  • 22.
    • Core literature pre-1923: 400,000 (80 million pages)
    • All pre-1923: 600-750,000 (120-150 million pages)
    • All literature: 1.4-1.6 million (280-320 million pages)
    Biodiversity Heritage Library
  • 23. Biodiversity Heritage Library Mandates: Open Access: all content can be reused, repurposed, reformatted, sliced, diced, scraped, and ???
  • 24. Data Types
    • CR2: Raw camera files (IA)
    • JPEG 2000
    • JPEG (IA)
    • GIF (IA)
    • Thumbnail (IA)
    • Flippy Book (IA)
    • PDF
    • DejaVu (IA)
  • 25. Data Types
    • OCR Text
      • Raw OCR Text
      • Structured OCR Text
      • OCR Text w/embedded Taxonomic Intelligence
      • Structured OCR w/embedded Taxonomic Intelligence
  • 26. BHL Portal Prototype
  • 27. Taxonomic Impediment
    • Specimen
    • Plate or other visual image
    • Taxonomic description
  • 28. View
  • 29. 9. Page View
  • 30.  
  • 31. 9. Page View
  • 32.  
  • 33.  
  • 34. 9. Page View
  • 35. 10. Page View - Detail
  • 36. 11. Page View – Detail – Full Screen
  • 37. 12. Page View - Detail
  • 38.  
  • 39.  
  • 40.  
  • 41. 12. Page View - Detail
  • 42. Discover names
  • 43.  
  • 44. . Names View
  • 45.  
  • 46. . Names View
  • 47.  
  • 48.  
  • 49.  
  • 50. . Names View
  • 51.  
  • 52.  
  • 53. Taxonomic Intelligence
  • 54. Taxonomic Intelligence
  • 55. Taxonomic Intelligence
  • 56. Taxonomic Intelligence
  • 57. Taxonomic Intelligence Vernacular terms Link outs
  • 58. Taxonomic Intelligence Generated Taxa Lists
  • 59.
    • http:// namebank.ubio.org/bulletin/process.php
    Taxonomic Intelligence
  • 60. Biodiversity Heritage Library Jacob Christian Schäffer Elementa entomologica . . . 1766. Metadata Repository Store all bibliographic metadata for the member libraries; create volume, part, piece metadata; ingest page level metadata at scanning level for the creation of page level Globally Unique Identifiers (GUIDs) for linking to other taxonomic services
  • 61. Preliminary First Steps
    • Combined metadata from member libraries = “Dirty Metadata Repository”
    • OCLC analysis
    • Worth while? Verdict still out
  • 62. Metadata Analysis
    • Initial analysis showed:
        • We have 1.3 million catalogue records
        • 73% are monographs (remainder are serials at title-level)
        • 63% is English language material. The next most popular language (9%) is German.
        • About 30% of material was published before 1923.
  • 63. Metadata Analysis
    • Record files were received from Smithsonian, MOBOT, NYBG, Kew, NHML, Harvard, and AMNH.
      • Total records: 1,330,058
    • From these files, all records describing language-based monographs were extracted (LDR/6 and LDR/7 equal to “a” and “m”, respectively).
      • Total records: 981,703
    • Assumed Serials
      • Total 256,962
  • 64. Metadata Analysis
    • 757,430 Total Monograph records made up of
      • 616,196 records with no matches (assumed unique)
      • 141,234 records representing a cluster
  • 65. Metadata Analysis
    • Overlap analysis
        • Of the 981,000 monograph records from all institutions 378,000 matching pairs were found
        • 616,000 had no matches at all and were unique to one institution.
        • After de-duplication of the matching pairs, the final file contains 757,000 records .
  • 66. Metadata Analysis
    • 981,703 monograph records analyzed by OCLC’s duplicate detection software
      • 378,579 pairs detected and then clustered by A=B and B=C => A=C
    • 151,705 unique items
      • BUT Grand total of too many (1,032,494 increase of 50,791) ~ Logic equation wasn’t quite right!
  • 67. Metadata Analysis
    • Problems Problems Problems
      • Natural History London fixed field coding that OCLC did a monograph vs serial title base match was not “consistent”
      • Harvard catalog contained quite a few “monograph” records for analyzed library specific bounded articles
  • 68. Metadata Analysis
    • Serials! Guesstimate!
      • 60 million pages (300,000 volumes of 200 pages each)
  • 69. Outline / Workflow
    • Scanning centers
      • 10 scanners in a pod
        • REQUIRES food at approximately XXX volumes per YYY
          • Boston
          • NYC area
          • DC
          • London
      • Single Scanning Station
  • 70. Outline / Workflow
    • 10 Natural History Libraries Scanning at Once
    • Who is to Scan What?
      • OCLC analysis assist in prioritizing
      • Collection Managers’
      • Gross general themes to begin
      • No longer worried about “Registry of Intent to Scan”
  • 71. Outline / Workflow
    • Volumes are pulled and taken to scanner
    • Scanner wands barcode and uses a Z39.50 to fetch a title level record from ILS
    • Problem
    • Multivolumes and Serials!
    • Title level descriptions – BUT – No item level metadata
  • 72. Problem: Issue-ization
    • Page scan data
    • Title level data
    • Missing is the in between – Citation resolving
    • CCS – some success but NOT open source
    • Citeseer – Lee Giles at PSU
  • 73. Outline / Workflow
    • “ Clean Metadata Repository”
      • Title Level
      • Intellectual Units to Some Granularity
      • URL pointing to BHL “portal”
      • Identifiers registered somewhere
        • LSIDs
        • DOIs
        • BHL uniquely defined
  • 74. Outline / Workflow
    • Clean Metadata Repository as a Source
      • For OCLC to pull and point
      • For local ILS’ to pull and point
      • For NSDL and other harvesters
  • 75. BHL Metadata Repository Internet Archive BHL MR BHL Public Interface Taxonomic Web Services e.g. CBOL, GBIF, ITIS, GenBank, INOTAXA documents, etc. BHL MR BHL MR
  • 76. Timeline
    • BHL Metadata Repository for currently scanned titles: January 2007
    • BHL Portal for existing literature: March 2007
    • Funding for Mass Scanning: Late Spring 2007?
  • 77. Biodiversity Heritage Library
  • 78. Biodiversity Heritage Library
  • 79. Biodiversity Heritage Library
  • 80. Biodiversity Heritage Library
  • 81. Biodiversity Heritage Library: A Conversation About A Collaborative Digitization Project Suzanne C. Pilsk Martin R. Kalfatovic Smithsonian Institution Libraries Thanks to the following for input/content: Chris Freeland (Missouri Botanical Garden) Neil Thomson (Natural History Museum, London) Anna Weitzman (National Museum of Natural History) Chris Lyal (Natural History Museum, London) Scott Miller (Smithsonian Institution)
  • 82.
    • Biodiversity Heritage Library (BHL)
    • http://www.bhl.si.edu
    • Universal Biological Indexer and Organizer (UBio)
    • http://www.ubio.org/
    • Consortium for the Barcode of Life (CBOL)
    • http://barcoding.si.edu/
    • Global Biodiversity Information Facility (GBIF)
    • http://barcoding.si.edu/
    • Taxonomic Databases Working Group (TDWG)
    • http://www.nhm.ac.uk/hosted_sites/tdwg/
    Conversation About a Collaborative Digitization Project http://www.sil.si.edu/staff/2006-BHL4LC/