Biodiversity Heritage Library (BHL): Technology Overview Chris Freeland Director, Bioinformatics Missouri Botanical Garden Technical Director Biodiversity Heritage Library [email_address] www.biodiversitylibrary.org
BHL Partners Museums American Museum of Natural History (New York) Natural History Museum (London) Smithsonian Institution (Washington) The Field Museum (Chicago) Botanical Gardens Missouri Botanical Garden New York Botanical Garden Royal Botanic Garden, Kew University Libraries Botany Libraries, Harvard University Ernst Meyer Library of the Museum of Comparative Zoology, Harvard University University of Illinois Bioinformatics Institutes  MBL/WHOI uBio.org
Why have BHL? In any well-appointed Natural History Library there should be found every book and every edition of every book dealing in the remotest way with the subjects concerned. One never knows wherein one edition differs from or supplements the other and unless these are on the same table at the same time it is not possible to collate them properly. Moreover for accurate work  it is necessary for the student to verify every reference he may find ; it is not enough to copy from a previous author; he must  verify each reference itself from the original . Charles Davies Sherborn, Epilogue to  Index Animalium , March 1922 Charles Davies Sherborn (1861-1942)
Unique Components of BHL Combining metadata records from multiple libraries (similar, but different) and representing through a shared portal Use of JPEG2000 Web 2.0 Mashups Taxonomic data mining Services Rare & novel content
Scanning process Select Book Pull from Shelf Send to IA scanning center Book is scanned & QA Page images loaded on IA cluster Derivatives created Book returned to library Files harvested from IA portal Books available for display within BHL portal
Mushrooms of America, edible and poisonous.  Ed. by Julius A. Palmer, Jr.  , 1885.
Scan & Store: Internet Archive Scanning on Scribes Storage in Petaboxes
Scanning & Derivatives XML JP2 PDF JPG TXT DJVu Master Derivatives
Harvest from IA Extract, Transform, Load (ETL) Custom scripts to extract content via IA’s APIs Database scripts to transform to relational data structure Load into database
 
 
 
Stable URL Attribution Name Finding Page Turning Page Turning Zoom/Pan Download/View Browse Search Filter Target/Object
JPEG2000 (*.jp2) display RAW original => 85% .jp2 LuraTech encoder Wavelet compression LizardTech decoder Tiled on the fly, cached for performance GSIV browser-based client viewer ‘ AJAXian’
LizardTech ExpressServer Browser  GSIV.js www.biodiversitylibrary.org .jp2 .jpg IA  /page/1274907 pageid: 1274907 BHLdb http://www.archive.org/download/mushroomsofameri00palm/.../mushroomsofameri00palm_0010.jp2  images.mobot.org A user requests  Mushrooms of America, edible and poisonous , Plate X: http://www.biodiversitylibrary.org/page/1274907   locate: BHL/IA architecture = 5.0+ sec transfer Time to deliver image: 8+ sec
Reuse, don’t rebuild
TIF Image from Scanner Converted to text via PrimeOCR Name finding via TaxonFinder Extract names Submit to NameBank SOAP response Name Finding in action with Taxonomic Intelligence…
Names data mining
Tag cloud from LCSH Subject Heading from library catalog Expressed as MARCXML Tag Cloud
Geocoding LCSH
RSS Feeds Specific:   Last 25 books published in German from NYBG RSS Feed location:  http://www.biodiversitylibrary.org/RecentRss/25/GER/NYBG       Allgemeine  deutsche  Garten-Zeitung , 7, 1829  (added: 04/03/2008 )  Zeitschrift   fr   wissenschaftliche   Mikroskopie  und  fr   mikroskopische   Technik . 2, 1885  (added: 03/28/2008 )  Zeitschrift   fr   technische   Biologie . 7, 1919  (added: 03/27/2008 )  … General:   Last 25 books from all libraries RSS Feed location:  http://www.biodiversitylibrary.org/RecentRss/25       Summa  plantarum  : v.1  (added: 05/01/2008 )  Vegetable  materia   medica  of the United States  (added: 04/30/2008 )  The family herbal;  (added: 04/30/2008 )  …
Services Names v.1 released http://www.biodiversitylibrary.org/services/name/NameService.asmx Stable urls http://www.biodiversitylibrary.org/bibliography/1652 http://www.biodiversitylibrary.org/name/Carcharodon_carcharias Future: Citation Resolver Titles Resolver
BHL Name Services http://www.biodiversitylibrary.org/services/name/NameService.asmx
Provider Integration Encyclopedia of Life Atrium Andes Biodiversity Wikipedia EDIT Scratchpads More to come…
 
 
Hardware Infrastructure Distributed Partially redundant Work needed Mixed platforms Mixed app frameworks
MOBOT Petabox cluster Internet Archive
 
File Storage Estimates 4MB per page  including derivatives 1 million pages =  4TB storage  Expected output: 60 – 100 million pages 240 - 400 TB for files 10 - 20 GB for db
Future Work Services Citation Resolver Titles Resolver Interfaces Editing Authoritative Community Backend
Fedora Funded by Gordon and Betty Moore Foundation to adopt Fedora Commons Working with Internet Archive to define use and practice Project completion December 2009
Thank You Chris Freeland [email_address] BHL Portal www.biodiversitylibrary.org BHL Blog biodiversitylibrary.blogspot.com BHL collection at Internet Archive www.archive.org /details/biodiversity

BHL Technology Overview

  • 1.
    Biodiversity Heritage Library(BHL): Technology Overview Chris Freeland Director, Bioinformatics Missouri Botanical Garden Technical Director Biodiversity Heritage Library [email_address] www.biodiversitylibrary.org
  • 2.
    BHL Partners MuseumsAmerican Museum of Natural History (New York) Natural History Museum (London) Smithsonian Institution (Washington) The Field Museum (Chicago) Botanical Gardens Missouri Botanical Garden New York Botanical Garden Royal Botanic Garden, Kew University Libraries Botany Libraries, Harvard University Ernst Meyer Library of the Museum of Comparative Zoology, Harvard University University of Illinois Bioinformatics Institutes MBL/WHOI uBio.org
  • 3.
    Why have BHL?In any well-appointed Natural History Library there should be found every book and every edition of every book dealing in the remotest way with the subjects concerned. One never knows wherein one edition differs from or supplements the other and unless these are on the same table at the same time it is not possible to collate them properly. Moreover for accurate work it is necessary for the student to verify every reference he may find ; it is not enough to copy from a previous author; he must verify each reference itself from the original . Charles Davies Sherborn, Epilogue to Index Animalium , March 1922 Charles Davies Sherborn (1861-1942)
  • 4.
    Unique Components ofBHL Combining metadata records from multiple libraries (similar, but different) and representing through a shared portal Use of JPEG2000 Web 2.0 Mashups Taxonomic data mining Services Rare & novel content
  • 5.
    Scanning process SelectBook Pull from Shelf Send to IA scanning center Book is scanned & QA Page images loaded on IA cluster Derivatives created Book returned to library Files harvested from IA portal Books available for display within BHL portal
  • 6.
    Mushrooms of America,edible and poisonous. Ed. by Julius A. Palmer, Jr. , 1885.
  • 7.
    Scan & Store:Internet Archive Scanning on Scribes Storage in Petaboxes
  • 8.
    Scanning & DerivativesXML JP2 PDF JPG TXT DJVu Master Derivatives
  • 9.
    Harvest from IAExtract, Transform, Load (ETL) Custom scripts to extract content via IA’s APIs Database scripts to transform to relational data structure Load into database
  • 10.
  • 11.
  • 12.
  • 13.
    Stable URL AttributionName Finding Page Turning Page Turning Zoom/Pan Download/View Browse Search Filter Target/Object
  • 14.
    JPEG2000 (*.jp2) displayRAW original => 85% .jp2 LuraTech encoder Wavelet compression LizardTech decoder Tiled on the fly, cached for performance GSIV browser-based client viewer ‘ AJAXian’
  • 15.
    LizardTech ExpressServer Browser GSIV.js www.biodiversitylibrary.org .jp2 .jpg IA /page/1274907 pageid: 1274907 BHLdb http://www.archive.org/download/mushroomsofameri00palm/.../mushroomsofameri00palm_0010.jp2 images.mobot.org A user requests Mushrooms of America, edible and poisonous , Plate X: http://www.biodiversitylibrary.org/page/1274907 locate: BHL/IA architecture = 5.0+ sec transfer Time to deliver image: 8+ sec
  • 16.
  • 17.
    TIF Image fromScanner Converted to text via PrimeOCR Name finding via TaxonFinder Extract names Submit to NameBank SOAP response Name Finding in action with Taxonomic Intelligence…
  • 18.
  • 19.
    Tag cloud fromLCSH Subject Heading from library catalog Expressed as MARCXML Tag Cloud
  • 20.
  • 21.
    RSS Feeds Specific: Last 25 books published in German from NYBG RSS Feed location: http://www.biodiversitylibrary.org/RecentRss/25/GER/NYBG     Allgemeine deutsche Garten-Zeitung , 7, 1829 (added: 04/03/2008 ) Zeitschrift fr wissenschaftliche Mikroskopie und fr mikroskopische Technik . 2, 1885 (added: 03/28/2008 ) Zeitschrift fr technische Biologie . 7, 1919 (added: 03/27/2008 ) … General: Last 25 books from all libraries RSS Feed location: http://www.biodiversitylibrary.org/RecentRss/25     Summa plantarum : v.1 (added: 05/01/2008 ) Vegetable materia medica of the United States (added: 04/30/2008 ) The family herbal; (added: 04/30/2008 ) …
  • 22.
    Services Names v.1released http://www.biodiversitylibrary.org/services/name/NameService.asmx Stable urls http://www.biodiversitylibrary.org/bibliography/1652 http://www.biodiversitylibrary.org/name/Carcharodon_carcharias Future: Citation Resolver Titles Resolver
  • 23.
    BHL Name Serviceshttp://www.biodiversitylibrary.org/services/name/NameService.asmx
  • 24.
    Provider Integration Encyclopediaof Life Atrium Andes Biodiversity Wikipedia EDIT Scratchpads More to come…
  • 25.
  • 26.
  • 27.
    Hardware Infrastructure DistributedPartially redundant Work needed Mixed platforms Mixed app frameworks
  • 28.
    MOBOT Petabox clusterInternet Archive
  • 29.
  • 30.
    File Storage Estimates4MB per page including derivatives 1 million pages = 4TB storage Expected output: 60 – 100 million pages 240 - 400 TB for files 10 - 20 GB for db
  • 31.
    Future Work ServicesCitation Resolver Titles Resolver Interfaces Editing Authoritative Community Backend
  • 32.
    Fedora Funded byGordon and Betty Moore Foundation to adopt Fedora Commons Working with Internet Archive to define use and practice Project completion December 2009
  • 33.
    Thank You ChrisFreeland [email_address] BHL Portal www.biodiversitylibrary.org BHL Blog biodiversitylibrary.blogspot.com BHL collection at Internet Archive www.archive.org /details/biodiversity