Hello! My name is Bianca Crowley and I am the Collections Coordinator for the Biodiversity Heritage Library. In the next 10 minutes, I will be talking to you about how the BHL incorporates hundreds of e-book materials into its online collection every week and all for free.
The Biodiversity Heritage Library (BHL) is a consortium of 12 natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global “biodiversity commons.” The BHL project launched its web portal in May of 2007. Now a collection of over 46,000 titles, 89,000 volumes and 33.2 million pages, the BHL has garnered a dedicated user base and is expanding globally with partner projects developing in Europe, China, Australia, Egypt and Brazil.
Libraries in the BHL consortium work with the Internet Archive to digitize the books from their respective collections. Materials are selected and de-duplicated to the best of our ability, and shipped on carts to the nearest scanning facility. Books are scanned on non-destructive scanners, called Scribes. At the time of scanning, descriptive metadata is passed from the contributing library’s catalog to populate records for the digitized works. Page images, automatically rendered OCR text, and metadata files are generated for each item and stored in the Internet Archive repository. The BHL then harvests the metadata, image, and text files and serves them via the BHL web portal. BHL adds new services on top of the content, such as uBio’s taxonomic name finding service, which algorithmically identifies the scientific names embedded within the OCR text. Finally, user feedback is collected and any requests for scanning inform the selection process.
The Internet Archive is an organization “offering permanent access for researchers, historians, scholars, people with disabilities, and the general public to historical collections that exist in digital format”. They work with other libraries and institutions to provide low-cost digitization and data storage services.The Internet Archive administers the Open Content Alliance, “a collaborative effort of…organizations from around the world [that work together to] build a permanent archive of multilingual digitized text and multimedia material.” IA‘s corpus of digitized material is, in large part, completely free and open for the taking, making for a veritable sea of e-books to chose from for integration into your own collection.
Likewise, BHL content is completely free to access, download, and re-use. Materials in the collection are either: 1) in the public domain, published prior to 1923, 2) have been identified as texts without copyright renewal or 3) permitted for digitization by the express agreement of the copyright holder. Users are welcome to download all or part of a book or volume as PDF, high resolution JPEG2000 or TIFF, or OCR text files. The BHL goes beyond open access, however, to promote open data. All the bibliographic metadata held within its repository, as well as the millions of scientific names identified in the OCR text, are made freely available via data export and web services that allow users to re-purpose and re-mix BHL content for their own use. Intrepid individuals behind projects like BioStor, as well as organizations like the Encyclopedia of Life (EOL) and JSTOR Plant Science are putting BHL data to work for their own purposes.
So wouldn’t it be great if you could have access to all the taxonomic names for the flora and fauna of the world in one convenient place? Without the restrictions of physical space, the opportunity exists for a digital library to serve as the “one stop”, trusted repository of biodiversity literature. But, then, what exactly is biodiversity literature and how much of it is out there? “Biodiversity” scholarship covers a wide spectrum of disciplines. As the U.N. Convention on Biological Diversity describes it as, “the variability among living organisms from all sources including…terrestrial, marine and other aquatic ecosystems and the ecological complexes of which they are part; this includes diversity within species, between species and of ecosystems.” This quote by E.O. Wilson captures the extensive nature of biodiversity research and thus the broad approach that must be taken in amassing a collection that serves a wide, interdisciplinary audience.
At its core, the BHL serves are zoologists, botanists, evolutionary biologists, ecologists, natural history collections managers, scientific illustrators, biological science historiographers, and amateur scientists & hobbyists.This representative subset of BHL subject terms, adapted from LCSH, shows core and supporting subject matter.When deciding what to scan from our own collections, we try to maximize our scanning dollars by prioritizing the core literature.
But without clear-cut boundaries around biodiversity literature and the freedom to amass lots of content within a virtual space, where do you draw the line? Collection development for digital libraries is about the “long tail”. As collection boundaries can be more fluid, opportunities exist to acquire new content by harvesting materials from open repositories like the Internet Archive.
With over 2.7 million items in the IA e-book corpus, how do you identify the right content to bring into your collection? The BHL developed a methodology for ingesting content based on LCSH and LCcall numbers relevant to the wide variety of disciplines that make up biodiversity research. First, we examined the e-book metadata available in the Internet Archive and decided to proceed with those records we knew best how to handle. We needed a predictable set of fields upon which to match our set of LCSH and call number criteria, thus we eliminated the incorporation of content that did not have metadata in the form of MARC records. Next we examined the subject headings and call nos. for existing BHL content and derived a list of the most common LCSH and LC classes, such as QK and QL. We specified whichMARC fields to match against and established a list of form and genre headings for exclusion, such as “Fiction” and “Humor”.The criteria needed to match on simple If/Then clauses and not – IF “Anatomy” AND “Birds OR Fish” THEN ingest ELSE reject. Imagine having to design a separate query statement for each LC term based on all its possible variant meanings?! This meant matching only on the MARC 650 a only.In November of 2009, the BHL completed its first ingest of IA content, increasing its collection by over 19,000 titles!
Analysis of our newly ingested content showed we were picking up some good things…and some weird things. We noticed irrelevant content coming in, such as books about human hygiene and medicine, as a result of terms like “Physiology” and “Anatomy”. Lessons learned: 1) avoid broad terms 2) match against specific MARC call no. fields 3) take a more targeted approach for the ingest of content from libraries whose collections are not necessarily focused on biodiversity subject matter.
As part of the new criteria for ingesting content, we implemented 5 strategies to refine our “bait” and better target our catch. We…EVOLUTION – a book about the evolution of birds would still have the subject term “birds” and books on the evolution of social movements would be avoidedIn June of 2010, we implemented the new ingest criteria, which runs on a weekly basis, bringing in an average of 120 new titles per week.0 = LC3 = National Ag Lib
Analysis of ingested content is ongoing. Overall, we found that the new criteria produced better results where core subject terms like “natural history” and “botany” were now returned in greater proportion than non-core terms like “agriculture” and “hunting”. As funding for book digitization tapers off, the BHL is looking into other e-book repositories that may provide potential opportunities for harvesting already-scanned content. Our work to supplement our collection with open access e-book content is ever evolving. As the BHL project expands globally and exchanges content with other data pools, the ingest methodology will prove to be a useful tool in our collection development tool box going forward.
Fishing for the Right Content in a Sea of Free E-books Bianca Crowley Biodiversity Heritage Library Collections Coordinator
American Museum of Natural History (New York) Academy of Natural Sciences Philadelphia Botany Libraries, Harvard University California Academy of Sciences (San Francisco) Ernst Mayr Library of the Museum of Comparative Zoology, Harvard University Field Museum (Chicago) Marine Biological Laboratory / Woods Hole Oceanographic Institution Missouri Botanical Garden (St. Louis) Natural History Museum (London) New York Botanical Garden (New York) Royal Botanic Garden, Kew Smithsonian Institution Libraries (Washington) http://biodiversitylibrary.org
BHL Digitization Overview SELECTION Descriptive Metadata Images, OCR text, XML Taxonomic Name Services BHL User Feedback
BHL Member Library Contributed Titles Content Selection Titles Ingested from the Internet Archive
How do you identify the right content in a sea of possibilities? Examine your own kind Survey the waters Cast a wide net Evaluate
One Fish, Two Fish, Red Fish, Fail Validation of ice skating protocol to predict aerobic power in hockey players / by Nicholas J. Petrella. “50 fish from American waters” by Allen & Ginter (188-) A measure of meter conservation in music, based on Piaget's theory / by Mary Louise Serafine. “The Gardeners' chronicle :a weekly illustrated journal of horticulture and allied subjects” (1874) “Botany” by Brewer, Watson, & Gray 1st & 2nd ed. Keeping the body in health, by M. V. O'Shea and J. H. Kellogg
Refining our Bait Removed broad terms, like “Biology” and “Evolution” Identified useful LCSH terms from the titles ingested in November to add to our criteria Identified terms associated with content that we wanted to exclude from our collection, e.g. “Gambling” & “Taxation” Targeted specific MARC fields to match on call numbers; added a match against predetermined set of Dewey call nos. Matched only against MARC ‘650 |a’s where second indicator equaled 0 (LC) or 3 (NAL)