Biodiversity Heritage Library: A Conversation About A Collaborative Digitization Project Suzanne C. Pilsk Martin R. Kalfat...
Biodiversity <ul><li>What is Biodiversity? </li></ul><ul><li>Genetic variability within species </li></ul><ul><li>Diversit...
Biodiversity <ul><li>Wholesome food </li></ul><ul><li>Drinkable water </li></ul><ul><li>Breathable air </li></ul><ul><li>S...
Taxonomic Literature <ul><li>Over 250 years of systematic description of life </li></ul><ul><li>Systema naturae  (10 th  e...
Taxonomic Literature The cited half-life of publications in taxonomy is longer than in any other scientific discipline * *...
Taxonomic Impediment <ul><li>Specimen collections </li></ul><ul><li>Databases </li></ul><ul><li>Publications </li></ul><ul...
Taxonomic Impediment Agatea violaris Type specimen from the U.S. National Herbarium (Smithsonian Institution) collected by...
Taxonomic Impediment
Taxonomic Impediment <ul><li>Specimen </li></ul><ul><li>Plate or other visual image </li></ul><ul><li>Taxonomic descriptio...
Taxonomic Literature <ul><li>that there is access to information held in national/regional/global collections </li></ul><u...
Taxonomic Impediment Biologia Centrali-Americana.  Edited by Frederick Ducane Godman and Osbert Salvin. London : Pub. for ...
Digital Divide?
Digital Divide? Vishwas Chavan travels a lot. An informatician based at the National Chemical Laboratory in Pune, India, h...
Encyclopedia of Life … imagine for a moment that all the diversity of the world were finally revealed and then described, ...
Biodiversity Heritage Library <ul><li>2003, Telluride. Encyclopaedia of Life meeting </li></ul><ul><li>February 2005, Lond...
Biodiversity Heritage Library <ul><li>Museums </li></ul><ul><ul><li>American Museum of Natural History (New York) </li></u...
Biodiversity Heritage Library <ul><li>Botanical Gardens </li></ul><ul><ul><li>Missouri Botanical Garden </li></ul></ul><ul...
Biodiversity Heritage Library <ul><li>University Libraries </li></ul><ul><ul><li>Botany Libraries, Harvard University </li...
Biodiversity Heritage Library <ul><li>Bioinformatics Member </li></ul><ul><ul><li>Marine Biological Laboratory / Woods Hol...
Biodiversity Heritage Library <ul><li>Affiliated Partner: Internet Archive </li></ul>
Biodiversity Heritage Library
<ul><li>Core literature pre-1923: 400,000 (80 million pages) </li></ul><ul><li>All pre-1923: 600-750,000 (120-150 million ...
Biodiversity Heritage Library Mandates: Open Access: all content can be reused, repurposed, reformatted, sliced, diced, sc...
Data Types <ul><li>CR2: Raw camera files (IA) </li></ul><ul><li>JPEG 2000 </li></ul><ul><li>JPEG (IA) </li></ul><ul><li>GI...
Data Types <ul><li>OCR Text </li></ul><ul><ul><li>Raw OCR Text </li></ul></ul><ul><ul><li>Structured OCR Text </li></ul></...
BHL Portal Prototype
Taxonomic Impediment <ul><li>Specimen </li></ul><ul><li>Plate or other visual image </li></ul><ul><li>Taxonomic descriptio...
View
9. Page View
 
9. Page View
 
 
9. Page View
10. Page View - Detail
11. Page View – Detail – Full Screen
12. Page View - Detail
 
 
 
12. Page View - Detail
Discover names
 
. Names View
 
. Names View
 
 
 
. Names View
 
 
Taxonomic Intelligence
Taxonomic Intelligence
Taxonomic Intelligence
Taxonomic Intelligence
Taxonomic Intelligence Vernacular terms Link outs
Taxonomic Intelligence Generated Taxa Lists
<ul><li>http:// namebank.ubio.org/bulletin/process.php   </li></ul>Taxonomic Intelligence
Biodiversity Heritage Library Jacob Christian Schäffer Elementa entomologica . . .  1766.  Metadata Repository Store all b...
Preliminary First Steps <ul><li>Combined metadata from member libraries = “Dirty Metadata Repository” </li></ul><ul><li>OC...
Metadata Analysis <ul><li>Initial analysis showed: </li></ul><ul><ul><ul><li>We have 1.3 million catalogue records  </li><...
Metadata Analysis <ul><li>Record files were received from Smithsonian, MOBOT, NYBG, Kew, NHML, Harvard, and AMNH. </li></u...
Metadata Analysis <ul><li>757,430 Total Monograph records made up of  </li></ul><ul><ul><li>616,196 records with no matche...
Metadata Analysis <ul><li>Overlap analysis  </li></ul><ul><ul><ul><li>Of the 981,000 monograph records from all institutio...
Metadata Analysis <ul><li>981,703 monograph records analyzed by OCLC’s duplicate detection software  </li></ul><ul><ul><li...
Metadata Analysis <ul><li>Problems Problems Problems </li></ul><ul><ul><li>Natural History London fixed field coding that ...
Metadata Analysis <ul><li>Serials!  Guesstimate! </li></ul><ul><ul><li>60 million pages (300,000 volumes of 200 pages each...
Outline / Workflow <ul><li>Scanning centers </li></ul><ul><ul><li>10 scanners in a pod </li></ul></ul><ul><ul><ul><li>REQU...
Outline / Workflow <ul><li>10 Natural History Libraries Scanning at Once </li></ul><ul><li>Who is to Scan What? </li></ul>...
Outline / Workflow <ul><li>Volumes are pulled and taken to scanner </li></ul><ul><li>Scanner wands barcode and uses a Z39....
Problem: Issue-ization  <ul><li>Page scan data </li></ul><ul><li>Title level data </li></ul><ul><li>Missing is the in betw...
Outline / Workflow <ul><li>“ Clean Metadata Repository” </li></ul><ul><ul><li>Title Level </li></ul></ul><ul><ul><li>Intel...
Outline / Workflow <ul><li>Clean Metadata Repository as a Source </li></ul><ul><ul><li>For OCLC to pull and point </li></u...
BHL Metadata Repository Internet Archive BHL MR BHL Public Interface Taxonomic Web Services e.g. CBOL, GBIF, ITIS, GenBank...
Timeline <ul><li>BHL Metadata Repository for currently scanned titles: January 2007 </li></ul><ul><li>BHL Portal for exist...
Biodiversity Heritage Library
Biodiversity Heritage Library
Biodiversity Heritage Library
Biodiversity Heritage Library
Biodiversity Heritage Library: A Conversation About A Collaborative Digitization Project Suzanne C. Pilsk Martin R. Kalfat...
<ul><li>Biodiversity Heritage Library (BHL) </li></ul><ul><li>http://www.bhl.si.edu </li></ul><ul><li>Universal Biological...
Upcoming SlideShare
Loading in...5
×

Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

7,124

Published on

Presentation for the Office of Strategic Initiatives (November 8, 2006) with Suzanne C. Pilsk

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
7,124
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
40
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Biodiversity Heritage Library: A Conversation About A Collaborative Digitizing Prjoect

  1. 1. Biodiversity Heritage Library: A Conversation About A Collaborative Digitization Project Suzanne C. Pilsk Martin R. Kalfatovic Smithsonian Institution Libraries
  2. 2. Biodiversity <ul><li>What is Biodiversity? </li></ul><ul><li>Genetic variability within species </li></ul><ul><li>Diversity of species </li></ul><ul><li>Ecosystems and landscapes </li></ul>
  3. 3. Biodiversity <ul><li>Wholesome food </li></ul><ul><li>Drinkable water </li></ul><ul><li>Breathable air </li></ul><ul><li>Stable climate for </li></ul><ul><ul><li>Forestry </li></ul></ul><ul><ul><li>Agriculture </li></ul></ul><ul><ul><li>Fisheries </li></ul></ul><ul><li>Waste decomposition </li></ul><ul><li>Bioremediation </li></ul><ul><li>Invasive species </li></ul><ul><li>Pest control </li></ul><ul><li>Ecotourism </li></ul><ul><li>Pharmaceuticals </li></ul><ul><li>Genomics </li></ul><ul><li>Proteomics </li></ul><ul><li>Bioengineering </li></ul><ul><li>Biotechnology </li></ul><ul><li>Molecular design </li></ul><ul><li>Imitating nature </li></ul><ul><li>Designer organisms </li></ul><ul><li>Renewable feedstocks </li></ul><ul><li>Envirofriendly </li></ul><ul><li>Manufacturing processes </li></ul>
  4. 4. Taxonomic Literature <ul><li>Over 250 years of systematic description of life </li></ul><ul><li>Systema naturae (10 th ed. 1758) by Carl von Linné </li></ul>
  5. 5. Taxonomic Literature The cited half-life of publications in taxonomy is longer than in any other scientific discipline * * * The decay rate is longer than in any scientific discipline - Macro-economic case for open access, Tom Moritz
  6. 6. Taxonomic Impediment <ul><li>Specimen collections </li></ul><ul><li>Databases </li></ul><ul><li>Publications </li></ul><ul><li>Observations </li></ul><ul><li>‘ Gray’ literature </li></ul><ul><li>Index cards </li></ul><ul><li>Field notebooks </li></ul>
  7. 7. Taxonomic Impediment Agatea violaris Type specimen from the U.S. National Herbarium (Smithsonian Institution) collected by the United States Exploring Expedition, 1838-1842
  8. 8. Taxonomic Impediment
  9. 9. Taxonomic Impediment <ul><li>Specimen </li></ul><ul><li>Plate or other visual image </li></ul><ul><li>Taxonomic description </li></ul>
  10. 10. Taxonomic Literature <ul><li>that there is access to information held in national/regional/global collections </li></ul><ul><li>that electronic data is efficiently captured and provided in useable form </li></ul><ul><li>that existing information held in literature and by current experts is made available electronically </li></ul><ul><li>that stability of scientific names of organisms, used to access this information, is promoted </li></ul><ul><li>- Darwin Declaration, 1998 </li></ul>The essential requirements for accessing and utilising this global information are:
  11. 11. Taxonomic Impediment Biologia Centrali-Americana. Edited by Frederick Ducane Godman and Osbert Salvin. London : Pub. for the editors by R. H. Porter, 1879-1915
  12. 12. Digital Divide?
  13. 13. Digital Divide? Vishwas Chavan travels a lot. An informatician based at the National Chemical Laboratory in Pune, India, he collects data on what types of animal live where in India to enter into a biodiversity database … Much of the information Chavan seeks is in old, out-of-print tomes … To find them, Chavan has spent years trailing around libraries. He dreams of the day when books such as these are scanned and made available as digital files on the Internet. “ Science in the Web Age: The Real Death of Print” by Andreas von Bubnoff Nature 438, 550-552 1 December 2005
  14. 14. Encyclopedia of Life … imagine for a moment that all the diversity of the world were finally revealed and then described, say one page to a species. The description would contain the scientific name, a photograph or drawing, a brief diagnosis, and information of where the species if found. If published in conventional book form … this Great Encyclopedia of Life would occupy 60 meters of library shelf per million species … 100 million species of organisms … would extend through 6 kilometers of shelving … E.O. Wilson (1992 )
  15. 15. Biodiversity Heritage Library <ul><li>2003, Telluride. Encyclopaedia of Life meeting </li></ul><ul><li>February 2005, London. Library and Laboratory: the Marriage of Research, Data and Taxonomic Literature </li></ul><ul><li>May 2005, Washington. Ground work for the Biodiversity Heritage Library </li></ul><ul><li>June 2006, Washington. Organizational and Technical meeting </li></ul><ul><li>October 2006, St. Louis/San Francisco. Technical meetings </li></ul>
  16. 16. Biodiversity Heritage Library <ul><li>Museums </li></ul><ul><ul><li>American Museum of Natural History (New York) </li></ul></ul><ul><ul><li>Field Museum (Chicago) </li></ul></ul><ul><ul><li>Natural History Museum (London) </li></ul></ul><ul><ul><li>Smithsonian Institution (Washington) </li></ul></ul>
  17. 17. Biodiversity Heritage Library <ul><li>Botanical Gardens </li></ul><ul><ul><li>Missouri Botanical Garden </li></ul></ul><ul><ul><li>New York Botanical Garden </li></ul></ul><ul><ul><li>Royal Botanic Garden, Kew </li></ul></ul>
  18. 18. Biodiversity Heritage Library <ul><li>University Libraries </li></ul><ul><ul><li>Botany Libraries, Harvard University </li></ul></ul><ul><ul><li>Ernst Meyer Library of the Museum of Comparative Zoology, Harvard University </li></ul></ul>
  19. 19. Biodiversity Heritage Library <ul><li>Bioinformatics Member </li></ul><ul><ul><li>Marine Biological Laboratory / Woods Hole Oceanographic Institution Library (MBL/WHOI) </li></ul></ul><ul><ul><li>uBio project of MBL/WHOI </li></ul></ul>
  20. 20. Biodiversity Heritage Library <ul><li>Affiliated Partner: Internet Archive </li></ul>
  21. 21. Biodiversity Heritage Library
  22. 22. <ul><li>Core literature pre-1923: 400,000 (80 million pages) </li></ul><ul><li>All pre-1923: 600-750,000 (120-150 million pages) </li></ul><ul><li>All literature: 1.4-1.6 million (280-320 million pages) </li></ul>Biodiversity Heritage Library
  23. 23. Biodiversity Heritage Library Mandates: Open Access: all content can be reused, repurposed, reformatted, sliced, diced, scraped, and ???
  24. 24. Data Types <ul><li>CR2: Raw camera files (IA) </li></ul><ul><li>JPEG 2000 </li></ul><ul><li>JPEG (IA) </li></ul><ul><li>GIF (IA) </li></ul><ul><li>Thumbnail (IA) </li></ul><ul><li>Flippy Book (IA) </li></ul><ul><li>PDF </li></ul><ul><li>DejaVu (IA) </li></ul>
  25. 25. Data Types <ul><li>OCR Text </li></ul><ul><ul><li>Raw OCR Text </li></ul></ul><ul><ul><li>Structured OCR Text </li></ul></ul><ul><ul><li>OCR Text w/embedded Taxonomic Intelligence </li></ul></ul><ul><ul><li>Structured OCR w/embedded Taxonomic Intelligence </li></ul></ul>
  26. 26. BHL Portal Prototype
  27. 27. Taxonomic Impediment <ul><li>Specimen </li></ul><ul><li>Plate or other visual image </li></ul><ul><li>Taxonomic description </li></ul>
  28. 28. View
  29. 29. 9. Page View
  30. 31. 9. Page View
  31. 34. 9. Page View
  32. 35. 10. Page View - Detail
  33. 36. 11. Page View – Detail – Full Screen
  34. 37. 12. Page View - Detail
  35. 41. 12. Page View - Detail
  36. 42. Discover names
  37. 44. . Names View
  38. 46. . Names View
  39. 50. . Names View
  40. 53. Taxonomic Intelligence
  41. 54. Taxonomic Intelligence
  42. 55. Taxonomic Intelligence
  43. 56. Taxonomic Intelligence
  44. 57. Taxonomic Intelligence Vernacular terms Link outs
  45. 58. Taxonomic Intelligence Generated Taxa Lists
  46. 59. <ul><li>http:// namebank.ubio.org/bulletin/process.php </li></ul>Taxonomic Intelligence
  47. 60. Biodiversity Heritage Library Jacob Christian Schäffer Elementa entomologica . . . 1766. Metadata Repository Store all bibliographic metadata for the member libraries; create volume, part, piece metadata; ingest page level metadata at scanning level for the creation of page level Globally Unique Identifiers (GUIDs) for linking to other taxonomic services
  48. 61. Preliminary First Steps <ul><li>Combined metadata from member libraries = “Dirty Metadata Repository” </li></ul><ul><li>OCLC analysis </li></ul><ul><li>Worth while? Verdict still out </li></ul>
  49. 62. Metadata Analysis <ul><li>Initial analysis showed: </li></ul><ul><ul><ul><li>We have 1.3 million catalogue records </li></ul></ul></ul><ul><ul><ul><li>73% are monographs (remainder are serials at title-level) </li></ul></ul></ul><ul><ul><ul><li>63% is English language material. The next most popular language (9%) is German. </li></ul></ul></ul><ul><ul><ul><li>About 30% of material was published before 1923. </li></ul></ul></ul>
  50. 63. Metadata Analysis <ul><li>Record files were received from Smithsonian, MOBOT, NYBG, Kew, NHML, Harvard, and AMNH. </li></ul><ul><ul><li>Total records: 1,330,058 </li></ul></ul><ul><li>From these files, all records describing language-based monographs were extracted (LDR/6 and LDR/7 equal to “a” and “m”, respectively). </li></ul><ul><ul><li>Total records: 981,703 </li></ul></ul><ul><li>Assumed Serials </li></ul><ul><ul><li>Total 256,962 </li></ul></ul>
  51. 64. Metadata Analysis <ul><li>757,430 Total Monograph records made up of </li></ul><ul><ul><li>616,196 records with no matches (assumed unique) </li></ul></ul><ul><ul><li>141,234 records representing a cluster </li></ul></ul>
  52. 65. Metadata Analysis <ul><li>Overlap analysis </li></ul><ul><ul><ul><li>Of the 981,000 monograph records from all institutions 378,000 matching pairs were found </li></ul></ul></ul><ul><ul><ul><li>616,000 had no matches at all and were unique to one institution. </li></ul></ul></ul><ul><ul><ul><li>After de-duplication of the matching pairs, the final file contains 757,000 records . </li></ul></ul></ul>
  53. 66. Metadata Analysis <ul><li>981,703 monograph records analyzed by OCLC’s duplicate detection software </li></ul><ul><ul><li>378,579 pairs detected and then clustered by A=B and B=C => A=C </li></ul></ul><ul><li>151,705 unique items </li></ul><ul><ul><li>BUT Grand total of too many (1,032,494 increase of 50,791) ~ Logic equation wasn’t quite right! </li></ul></ul>
  54. 67. Metadata Analysis <ul><li>Problems Problems Problems </li></ul><ul><ul><li>Natural History London fixed field coding that OCLC did a monograph vs serial title base match was not “consistent” </li></ul></ul><ul><ul><li>Harvard catalog contained quite a few “monograph” records for analyzed library specific bounded articles </li></ul></ul>
  55. 68. Metadata Analysis <ul><li>Serials! Guesstimate! </li></ul><ul><ul><li>60 million pages (300,000 volumes of 200 pages each) </li></ul></ul>
  56. 69. Outline / Workflow <ul><li>Scanning centers </li></ul><ul><ul><li>10 scanners in a pod </li></ul></ul><ul><ul><ul><li>REQUIRES food at approximately XXX volumes per YYY </li></ul></ul></ul><ul><ul><ul><ul><li>Boston </li></ul></ul></ul></ul><ul><ul><ul><ul><li>NYC area </li></ul></ul></ul></ul><ul><ul><ul><ul><li>DC </li></ul></ul></ul></ul><ul><ul><ul><ul><li>London </li></ul></ul></ul></ul><ul><ul><li>Single Scanning Station </li></ul></ul>
  57. 70. Outline / Workflow <ul><li>10 Natural History Libraries Scanning at Once </li></ul><ul><li>Who is to Scan What? </li></ul><ul><ul><li>OCLC analysis assist in prioritizing </li></ul></ul><ul><ul><li>Collection Managers’ </li></ul></ul><ul><ul><li>Gross general themes to begin </li></ul></ul><ul><ul><li>No longer worried about “Registry of Intent to Scan” </li></ul></ul>
  58. 71. Outline / Workflow <ul><li>Volumes are pulled and taken to scanner </li></ul><ul><li>Scanner wands barcode and uses a Z39.50 to fetch a title level record from ILS </li></ul><ul><li>Problem </li></ul><ul><li>Multivolumes and Serials! </li></ul><ul><li>Title level descriptions – BUT – No item level metadata </li></ul>
  59. 72. Problem: Issue-ization <ul><li>Page scan data </li></ul><ul><li>Title level data </li></ul><ul><li>Missing is the in between – Citation resolving </li></ul><ul><li>CCS – some success but NOT open source </li></ul><ul><li>Citeseer – Lee Giles at PSU </li></ul>
  60. 73. Outline / Workflow <ul><li>“ Clean Metadata Repository” </li></ul><ul><ul><li>Title Level </li></ul></ul><ul><ul><li>Intellectual Units to Some Granularity </li></ul></ul><ul><ul><li>URL pointing to BHL “portal” </li></ul></ul><ul><ul><li>Identifiers registered somewhere </li></ul></ul><ul><ul><ul><li>LSIDs </li></ul></ul></ul><ul><ul><ul><li>DOIs </li></ul></ul></ul><ul><ul><ul><li>BHL uniquely defined </li></ul></ul></ul>
  61. 74. Outline / Workflow <ul><li>Clean Metadata Repository as a Source </li></ul><ul><ul><li>For OCLC to pull and point </li></ul></ul><ul><ul><li>For local ILS’ to pull and point </li></ul></ul><ul><ul><li>For NSDL and other harvesters </li></ul></ul>
  62. 75. BHL Metadata Repository Internet Archive BHL MR BHL Public Interface Taxonomic Web Services e.g. CBOL, GBIF, ITIS, GenBank, INOTAXA documents, etc. BHL MR BHL MR
  63. 76. Timeline <ul><li>BHL Metadata Repository for currently scanned titles: January 2007 </li></ul><ul><li>BHL Portal for existing literature: March 2007 </li></ul><ul><li>Funding for Mass Scanning: Late Spring 2007? </li></ul>
  64. 77. Biodiversity Heritage Library
  65. 78. Biodiversity Heritage Library
  66. 79. Biodiversity Heritage Library
  67. 80. Biodiversity Heritage Library
  68. 81. Biodiversity Heritage Library: A Conversation About A Collaborative Digitization Project Suzanne C. Pilsk Martin R. Kalfatovic Smithsonian Institution Libraries Thanks to the following for input/content: Chris Freeland (Missouri Botanical Garden) Neil Thomson (Natural History Museum, London) Anna Weitzman (National Museum of Natural History) Chris Lyal (Natural History Museum, London) Scott Miller (Smithsonian Institution)
  69. 82. <ul><li>Biodiversity Heritage Library (BHL) </li></ul><ul><li>http://www.bhl.si.edu </li></ul><ul><li>Universal Biological Indexer and Organizer (UBio) </li></ul><ul><li>http://www.ubio.org/ </li></ul><ul><li>Consortium for the Barcode of Life (CBOL) </li></ul><ul><li>http://barcoding.si.edu/ </li></ul><ul><li>Global Biodiversity Information Facility (GBIF) </li></ul><ul><li>http://barcoding.si.edu/ </li></ul><ul><li>Taxonomic Databases Working Group (TDWG) </li></ul><ul><li>http://www.nhm.ac.uk/hosted_sites/tdwg/ </li></ul>Conversation About a Collaborative Digitization Project http://www.sil.si.edu/staff/2006-BHL4LC/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×