Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Biodiversity Heritage Library Mass Digitizing Project: A Grandeur in this View of Digital Libraries


Published on

The Biodiversity Heritage Library Mass Digitizing Project: A Grandeur in this View of Digital Libraries by Martin R. Kalfatovic and Suzanne C. Pilsk, Smithsonian Institution Libraries. LITA National Forum, October 2007. Denver, Colorado.

Published in: Education, Technology
  • Be the first to like this

The Biodiversity Heritage Library Mass Digitizing Project: A Grandeur in this View of Digital Libraries

  1. 1. <ul><ul><li>A Grandeur in this View of Digital Libraries </li></ul></ul><ul><ul><li>Martin R. Kalfatovic </li></ul></ul><ul><ul><li>Suzanne C. Pilsk </li></ul></ul><ul><ul><li>Smithsonian Institution Libraries </li></ul></ul><ul><ul><li>LITA Forum </li></ul></ul><ul><ul><li>6 October 2007 </li></ul></ul><ul><ul><li>Denver, Colorado </li></ul></ul>
  2. 2. There is grandeur in this view of life , with its several powers, having been originally breathed into a few forms or into one; and that, whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved. Charles Darwin, The Origin of Species, 1859
  3. 3. <ul><li>Boutique Scanning </li></ul><ul><li>Scanning back cameras </li></ul><ul><li>All special handling </li></ul><ul><li>Slow production </li></ul><ul><li>Expensive </li></ul>
  4. 4. <ul><li>Rare Books </li></ul><ul><li>Obvious choice </li></ul><ul><li>“Pretty pictures” </li></ul>
  5. 5. <ul><li>Rare Books </li></ul><ul><li>Rarely had textual access (OCR, etc.)‏ </li></ul><ul><li>Difficult to link to other materials </li></ul>
  6. 6. <ul><li>Mass Scanning Projects </li></ul><ul><li>Google Books </li></ul><ul><li>MSN Live </li></ul><ul><li>Internet Archive/Open Content Alliance </li></ul>
  7. 8. <ul><li>Difficult (impossible?) to repurpose much of the material </li></ul><ul><li>Quality of images often questionable </li></ul><ul><li>“Frankenbooks” </li></ul><ul><li>Sketchy / inaccurate bibliographic data </li></ul>
  8. 10. <ul><li>Content is more re-purposable than Google </li></ul><ul><li>Content not fully open </li></ul><ul><li>Nice search interface </li></ul><ul><li>Still, no context! </li></ul>
  9. 12. <ul><li>Open Access </li></ul><ul><li>Hodge-podge of collections </li></ul><ul><li>Interface hard to use! </li></ul><ul><li>Still, no, or very little context </li></ul>
  10. 15. <ul><li>United States Exploring Expedition </li></ul><ul><li>Biologia Centrali-Americana </li></ul>
  11. 17. <ul><li>Biologia Centrali-Americana </li></ul><ul><ul><li>Large scale content and repurposing text </li></ul></ul><ul><ul><li>Innotaxa </li></ul></ul>
  12. 21. <ul><li>Our Man Sherborn – the Squire </li></ul><ul><li>Cataloger at heart </li></ul><ul><li>Slowly went through every relevant text looking for names </li></ul><ul><li>Created an index that was useful as soon as he started </li></ul><ul><li>Index AND Bibliography of relevant texts from 1758 through 1850 </li></ul>
  13. 24. <ul><li>Re-keying of data </li></ul><ul><li>Parsing data </li></ul><ul><li>Using tools to harvest out data </li></ul><ul><li>Manually matching data </li></ul><ul><li>Wealth of information locked on the page is being liberated! </li></ul>
  14. 27. <ul><li>Current Uses </li></ul><ul><ul><li>Look up individual species </li></ul></ul><ul><ul><li>Batch whole data sets for comparisons </li></ul></ul><ul><li>Future Possibilities </li></ul><ul><ul><li>Updated species information </li></ul></ul><ul><ul><li>Accurate images of species </li></ul></ul><ul><ul><li>Geographic distributions </li></ul></ul><ul><ul><li>Other databases can link (such as uBio)‏ </li></ul></ul><ul><ul><li>Future prioritizing of scans </li></ul></ul>
  15. 28. <ul><li>Demonstration I: Index Animalium and Nomenclature Zoologicus </li></ul><ul><li> </li></ul><ul><li>http:// / </li></ul>
  16. 29. <ul><li>Nomina si nescis, perit et cognitio rerum </li></ul><ul><li>Who knoweth not the name, knoweth not the subject </li></ul><ul><li>~ Linnaeus, 1737, Critica Botanica n. 210 </li></ul>
  17. 30. <ul><li>Over 250 years of systematic description of life </li></ul><ul><li>Systema naturae (10 th ed. 1758) by Carl von Linné </li></ul>
  18. 31. <ul><li>Binomial Nomenclature </li></ul><ul><li>Genus name and Species epithet or descriptor </li></ul><ul><li>Latin or Latin-ized </li></ul><ul><li>Bill Gates' Flower Fly </li></ul><ul><li>Eristalis gatesi Thompson </li></ul>
  19. 32. <ul><li>St. Louis Code of International Code of Botanical Nomenclature </li></ul><ul><li>International Code of Zoological Nomenclature </li></ul><ul><li>International Code of Phylogenetic Nomenclature </li></ul>
  20. 33. <ul><li>Index Animalium’s Citation: </li></ul><ul><ul><li>albimanus Delphinus, T. R. Peale in Wilkes, Expl. Exped. VIII. 1848, 33 </li></ul></ul><ul><li>On page 33, of volume 8 of Charles Wilke’s 1848 publication: Narrative of the United State Exploring Expedition the Delphinus albmimanus was first named by T.R. Peale. </li></ul>
  21. 34. Agatea violaris Type specimen from the U.S. National Herbarium (Smithsonian Institution) collected by the United States Exploring Expedition, 1838-1842
  22. 36. <ul><li>Specimen </li></ul><ul><li>Plate or other visual image </li></ul><ul><li>Taxonomic description </li></ul>
  23. 38. The cited half-life of publications in taxonomy is longer than in any other scientific discipline * * * The decay rate is longer than in any scientific discipline ~ Macro-economic case for open access, Tom Moritz
  24. 39. <ul><li>Taxonomic descriptions must be published for the name to be valid </li></ul><ul><li>Publications must be available to the public through trusted sources </li></ul><ul><li>Libraries have been the traditional place </li></ul>
  25. 40. <ul><li>Specimen collections </li></ul><ul><li>Databases </li></ul><ul><li>Publications </li></ul><ul><li>Observations </li></ul><ul><li>‘ Gray’ literature </li></ul><ul><li>Index cards </li></ul><ul><li>Field notebooks </li></ul>
  26. 41. Biologia Centrali-Americana. Edited by Frederick Ducane Godman and Osbert Salvin. London : Pub. for the editors by R. H. Porter, 1879-1915
  27. 43. Vishwas Chavan travels a lot. An informatician based at the National Chemical Laboratory in Pune, India, he collects data on what types of animal live where in India to enter into a biodiversity database … Much of the information Chavan seeks is in old, out-of-print tomes … To find them, Chavan has spent years trailing around libraries. He dreams of the day when books such as these are scanned and made available as digital files on the Internet. “ Science in the Web Age: The Real Death of Print” by Andreas von Bubnoff Nature 438, 550-552 1 December 2005
  28. 47. <ul><li>What is Biodiversity? </li></ul><ul><li>Ecosystems and landscapes </li></ul><ul><li>Diversity of species </li></ul><ul><li>Genetic variability within species </li></ul>
  29. 52. <ul><li>Wholesome food </li></ul><ul><li>Drinkable water </li></ul><ul><li>Breathable air </li></ul><ul><li>Stable climate for </li></ul><ul><ul><li>Forestry </li></ul></ul><ul><ul><li>Agriculture </li></ul></ul><ul><ul><li>Fisheries </li></ul></ul><ul><li>Waste decomposition </li></ul><ul><li>Bioremediation </li></ul><ul><li>Invasive species </li></ul><ul><li>Pest control </li></ul><ul><li>Ecotourism </li></ul>
  30. 53. <ul><li>Pharmaceuticals </li></ul><ul><li>Genomics </li></ul><ul><li>Proteomics </li></ul><ul><li>Bioengineering </li></ul><ul><li>Biotechnology </li></ul><ul><li>Molecular design </li></ul><ul><li>Imitating nature </li></ul><ul><li>Designer organisms </li></ul><ul><li>Renewable feedstocks </li></ul><ul><li>Envirofriendly </li></ul><ul><li>Manufacturing processes </li></ul>
  31. 55. <ul><li>2003, Telluride. Encyclopedia of Life meeting </li></ul><ul><li>February 2005. London. Library and Laboratory: the Marriage of Research, Data and Taxonomic Literature </li></ul><ul><li>May 2005. Washington. Ground work for the Biodiversity Heritage Library </li></ul><ul><li>June 2006. Washington. Organizational and Technical meeting </li></ul><ul><li>August 2006. New York Botanical Garden. BHL Director’s Meeting. </li></ul><ul><li>October 2006. St. Louis/San Francisco. Technical meetings </li></ul><ul><li>February 2007. Museum of Comparative Zoology. Organizational meeting </li></ul><ul><li>May 2007. Encylopedia of Life Launch. Washington DC. </li></ul><ul><li>September 2007. Missouri Botanical Garden. Technical and Organizational Meeting. St. Louis, Missouri. </li></ul>
  32. 56. <ul><li>American Museum of Natural History (New York)‏ </li></ul><ul><li>Field Museum (Chicago)‏ </li></ul><ul><li>Natural History Museum (London)‏ </li></ul><ul><li>Smithsonian Institution (Washington) </li></ul><ul><li>Missouri Botanical Garden (St. Louis)‏ </li></ul>
  33. 57. <ul><li>New York Botanical Garden (New York)‏ </li></ul><ul><li>Royal Botanic Garden, Kew </li></ul><ul><li>Botany Libraries, Harvard University </li></ul><ul><li>Ernst Mayr Library of the Museum of Comparative Zoology, Harvard University </li></ul><ul><li>Marine Biological Laboratory / Woods Hole Oceanographic Institution </li></ul>
  34. 58. <ul><li>Core literature pre-1923: 400,000 (80 million pages)‏ </li></ul><ul><li>All pre-1923: 600-750,000 (120-150 million pages)‏ </li></ul><ul><li>All literature: 1.4-1.6 million (280-320 million pages)‏ </li></ul>
  35. 59. <ul><li>Most literature is in </li></ul><ul><li>the developed world </li></ul><ul><li>the Northern Hemisphere </li></ul><ul><li>Most Biodiversity is in </li></ul><ul><li>developing world </li></ul><ul><li>the Southern Hemisphere </li></ul>
  36. 60. <ul><li>Most literature is </li></ul><ul><li>in large libraries formed in the 19 th century </li></ul>
  37. 62. Who has what? What should we scan and when? Monographs vs Serials Series treated as separates Can it be found and used once scanned?
  38. 63. <ul><ul><li>Initial Metadata Analysis: </li></ul></ul><ul><ul><li>We have 1.3 million catalogue records </li></ul></ul><ul><ul><li>73% are monographs (remainder are serials at title-level) </li></ul></ul><ul><ul><li>63% is English language material. The next most popular language (9%) is German. </li></ul></ul><ul><ul><li>About 30% of material was published before 1923. </li></ul></ul>
  39. 66. <ul><li>Scalable Mass Scanning </li></ul><ul><li>Contracts </li></ul><ul><li>Firewalls </li></ul><ul><li>Bathrooms </li></ul><ul><li>Security </li></ul><ul><li>Loading docks </li></ul><ul><li>Trucks </li></ul>
  40. 70. <ul><li>Mass Scanning Workflow </li></ul><ul><li>Pick lists </li></ul><ul><li>Packing lists </li></ul><ul><li>Serials management </li></ul><ul><li>Monographic management </li></ul><ul><li>Stickers for books </li></ul>
  41. 72. Demonstration II: Workflow Tools
  42. 78. <ul><li>Stable URL </li></ul><ul><li>Handle </li></ul><ul><li>DOI </li></ul><ul><li>BICI/SICI </li></ul><ul><li>ISSN </li></ul><ul><li>ISBN </li></ul>
  43. 80. <ul><li>Biologia Centrali-Americana :zoology, botany and archaeology </li></ul><ul><ul><li>Mammalia </li></ul></ul><ul><ul><li>Aves v. 1 </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><ul><li>Aves v. 4 </li></ul></ul><ul><ul><li>Reptilia and Batrachia </li></ul></ul><ul><li>Biologia Centrali-Americana : Aves </li></ul><ul><ul><li>v. 1 Introduction -- Subclass Aves Carinate. Order Passeres. </li></ul></ul><ul><ul><li>v. 2 Subclass Aves Carinate. Order[s]: Passeres (contd.), Macrochires, Pici, Coccyges, Psitta </li></ul></ul>Is it Or
  44. 81. <ul><li>Zea mays L. </li></ul><ul><ul><li>Sp. Pl. 2: 971-972. 1753. </li></ul></ul><ul><li>Title: Species Plantarum </li></ul><ul><ul><ul><li>TL2: 4.769 </li></ul></ul></ul><ul><ul><ul><li>Tropicos Pub_id: 1071 </li></ul></ul></ul><ul><ul><ul><li>IPNI Pub_id:1071-2 </li></ul></ul></ul><ul><li>Volume: 2 </li></ul><ul><li>Start Page: 971 </li></ul><ul><li>End Page: 972 </li></ul><ul><li>Year Published: 1753 </li></ul>http:// http://
  45. 82. <ul><li>Page turning, multiple views, translations </li></ul><ul><li>PDF “Grab-n-Go” </li></ul>OCR Text Stellaria ongipes ? Goldie. Cornwallis Island. Silene acaWlis, L. Woman Islands. Sawfraga oppositifolia, L. Kakkidlarn, Greenland; Cornwallis Island. Sa4mfraga cerua, De Cand. Ukaari; Cornwallis Island. Sairaga csepitosa, L. Wolstenholme, Greenland; Cornwallis Island. Saxfraga rimdaris, De Cand. Whale Fish Island. Sad~ raga nivalis, L. Cornwallis Island. Ptentilla emarginata? Pursh. Wolstenholme. Dipena Lapponica, L. Whale Fish Island. Pyrola rotndifolia, L. Whale Fish Island. Casiope tetragona, Don. Ichauti. Vaccinium uliginosum, L. Loiseleuria procumbens, L. Whale Fish Island.
  46. 83. <ul><li><genus>Polygonum</genus> </li></ul><ul><li><species>viviparm</species>, </li></ul><ul><li><author>L.</author> </li></ul><ul><li><locality>Bushnan Island.</locality> </li></ul><ul><li>Created via automated or </li></ul><ul><li>semi-automated means. </li></ul>Machine-readable literature
  47. 87. <ul><li>10.3 million name strings in NameBank </li></ul><ul><li>Uses sophisticated algorithm (TaxonGrab) to locate likely name strings in OCR text </li></ul><ul><li>Iterative processing of BHL texts will both increase the number of name strings in NameBank and increase the accuracy of name string recognition </li></ul>
  48. 89. <ul><li>Demonstration III: Taxonomic Intelligence </li></ul><ul><li> </li></ul>
  49. 92. <ul><li>Search </li></ul><ul><li>Browse </li></ul>
  50. 93. <ul><li>Demonstration IV: BHL Portal </li></ul><ul><li> </li></ul>
  51. 95. <ul><li>Funding from the MacArthur and Sloan Foundations </li></ul><ul><li>Part of the larger Encyclopedia of Life project </li></ul>
  52. 97. Structure of the Encyclopedia of Life Serine Molecule
  53. 98. Serine Molecule Synthesis Center Field Museum Biodiversity Heritage Library Secretariat Smithsonian Education & Outreach Smithsonian/Harvard Informatics Marine Biological Laboratory & MOBOT
  54. 102. Let us … rejoice in the fact, that we have realised what no other kingdom can boast of, and that such vast and harmoniously related accumulation of knowledge is gathered together around a library Charles Darwin, et al, 1858
  55. 104. <ul><li>Thanks to: </li></ul><ul><ul><li>Chris Freeland, Missouri Botanical Garden </li></ul></ul><ul><ul><li>Neil Thomson, Natural History Museum, London </li></ul></ul><ul><ul><li>David Remsen, Global Biodiversity Information Facility </li></ul></ul><ul><ul><li>Neil Sarkar, Marine Biological Laboratory/Woods Hole Oceanographic Institution </li></ul></ul><ul><ul><li>Anna Weitzman, National Museum of Natural History </li></ul></ul><ul><ul><li>Chris Lyal, Natural History Museum, London </li></ul></ul><ul><ul><li>The staff at the Internet Archive </li></ul></ul><ul><li>Images from </li></ul><ul><ul><li>The Galaxy of Images, Smithsonian Libraries ( )‏ </li></ul></ul><ul><ul><li>NASA, Visible Earth Project </li></ul></ul><ul><ul><li>Martin R. Kalfatovic </li></ul></ul><ul><ul><li>Diane Rielinger </li></ul></ul>
  56. 105. <ul><li>Biodiversity Heritage Library </li></ul><ul><li>Encyclopedia of Life </li></ul><ul><li>Smithsonian Institution Libraries http:// / </li></ul><ul><li>Universal Biological Indexer and Organizer </li></ul><ul><li>Sherborn’s Index Animalium http:// / </li></ul><ul><li>Neave’s Nomenclator Zoologicus http:// / </li></ul><ul><li>United States Exploring Expedition http:// </li></ul><ul><li>Biologia Centrali-Americana </li></ul><ul><li>Botanicus http:// / </li></ul>