Your SlideShare is downloading. ×
Fishing for the Right Content in a Sea of Free E-books
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Fishing for the Right Content in a Sea of Free E-books

1,455
views

Published on


0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,455
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
10
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Hello! My name is Bianca Crowley and I am the Collections Coordinator for the Biodiversity Heritage Library. In the next 10 minutes, I will be talking to you about how the BHL incorporates hundreds of e-book materials into its online collection every week and all for free.
  • The Biodiversity Heritage Library (BHL) is a consortium of 12 natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global “biodiversity commons.” The BHL project launched its web portal in May of 2007. Now a collection of over 46,000 titles, 89,000 volumes and 33.2 million pages, the BHL has garnered a dedicated user base and is expanding globally with partner projects developing in Europe, China, Australia, Egypt and Brazil.
  • Libraries in the BHL consortium work with the Internet Archive to digitize the books from their respective collections. Materials are selected and de-duplicated to the best of our ability, and shipped on carts to the nearest scanning facility. Books are scanned on non-destructive scanners, called Scribes. At the time of scanning, descriptive metadata is passed from the contributing library’s catalog to populate records for the digitized works. Page images, automatically rendered OCR text, and metadata files are generated for each item and stored in the Internet Archive repository. The BHL then harvests the metadata, image, and text files and serves them via the BHL web portal. BHL adds new services on top of the content, such as uBio’s taxonomic name finding service, which algorithmically identifies the scientific names embedded within the OCR text. Finally, user feedback is collected and any requests for scanning inform the selection process.
  • The Internet Archive is an organization “offering permanent access for researchers, historians, scholars, people with disabilities, and the general public to historical collections that exist in digital format”. They work with other libraries and institutions to provide low-cost digitization and data storage services.The Internet Archive administers the Open Content Alliance, “a collaborative effort of…organizations from around the world [that work together to] build a permanent archive of multilingual digitized text and multimedia material.” IA‘s corpus of digitized material is, in large part, completely free and open for the taking, making for a veritable sea of e-books to chose from for integration into your own collection.
  • Likewise, BHL content is completely free to access, download, and re-use. Materials in the collection are either: 1) in the public domain, published prior to 1923, 2) have been identified as texts without copyright renewal or 3) permitted for digitization by the express agreement of the copyright holder. Users are welcome to download all or part of a book or volume as PDF, high resolution JPEG2000 or TIFF, or OCR text files. The BHL goes beyond open access, however, to promote open data. All the bibliographic metadata held within its repository, as well as the millions of scientific names identified in the OCR text, are made freely available via data export and web services that allow users to re-purpose and re-mix BHL content for their own use. Intrepid individuals behind projects like BioStor, as well as organizations like the Encyclopedia of Life (EOL) and JSTOR Plant Science are putting BHL data to work for their own purposes.
  • So wouldn’t it be great if you could have access to all the taxonomic names for the flora and fauna of the world in one convenient place? Without the restrictions of physical space, the opportunity exists for a digital library to serve as the “one stop”, trusted repository of biodiversity literature. But, then, what exactly is biodiversity literature and how much of it is out there? “Biodiversity” scholarship covers a wide spectrum of disciplines. As the U.N. Convention on Biological Diversity describes it as, “the variability among living organisms from all sources including…terrestrial, marine and other aquatic ecosystems and the ecological complexes of which they are part; this includes diversity within species, between species and of ecosystems.” This quote by E.O. Wilson captures the extensive nature of biodiversity research and thus the broad approach that must be taken in amassing a collection that serves a wide, interdisciplinary audience.
  • At its core, the BHL serves are zoologists, botanists, evolutionary biologists, ecologists, natural history collections managers, scientific illustrators, biological science historiographers, and amateur scientists & hobbyists.This representative subset of BHL subject terms, adapted from LCSH, shows core and supporting subject matter.When deciding what to scan from our own collections, we try to maximize our scanning dollars by prioritizing the core literature.
  • But without clear-cut boundaries around biodiversity literature and the freedom to amass lots of content within a virtual space, where do you draw the line? Collection development for digital libraries is about the “long tail”. As collection boundaries can be more fluid, opportunities exist to acquire new content by harvesting materials from open repositories like the Internet Archive.
  • With over 2.7 million items in the IA e-book corpus, how do you identify the right content to bring into your collection? The BHL developed a methodology for ingesting content based on LCSH and LCcall numbers relevant to the wide variety of disciplines that make up biodiversity research. First, we examined the e-book metadata available in the Internet Archive and decided to proceed with those records we knew best how to handle. We needed a predictable set of fields upon which to match our set of LCSH and call number criteria, thus we eliminated the incorporation of content that did not have metadata in the form of MARC records. Next we examined the subject headings and call nos. for existing BHL content and derived a list of the most common LCSH and LC classes, such as QK and QL. We specified whichMARC fields to match against and established a list of form and genre headings for exclusion, such as “Fiction” and “Humor”.The criteria needed to match on simple If/Then clauses and not – IF “Anatomy” AND “Birds OR Fish” THEN ingest ELSE reject. Imagine having to design a separate query statement for each LC term based on all its possible variant meanings?! This meant matching only on the MARC 650 a only.In November of 2009, the BHL completed its first ingest of IA content, increasing its collection by over 19,000 titles!
  • Analysis of our newly ingested content showed we were picking up some good things…and some weird things. We noticed irrelevant content coming in, such as books about human hygiene and medicine, as a result of terms like “Physiology” and “Anatomy”. Lessons learned: 1) avoid broad terms 2) match against specific MARC call no. fields 3) take a more targeted approach for the ingest of content from libraries whose collections are not necessarily focused on biodiversity subject matter.
  • As part of the new criteria for ingesting content, we implemented 5 strategies to refine our “bait” and better target our catch. We…EVOLUTION – a book about the evolution of birds would still have the subject term “birds” and books on the evolution of social movements would be avoidedIn June of 2010, we implemented the new ingest criteria, which runs on a weekly basis, bringing in an average of 120 new titles per week.0 = LC3 = National Ag Lib
  • Analysis of ingested content is ongoing. Overall, we found that the new criteria produced better results where core subject terms like “natural history” and “botany” were now returned in greater proportion than non-core terms like “agriculture” and “hunting”. As funding for book digitization tapers off, the BHL is looking into other e-book repositories that may provide potential opportunities for harvesting already-scanned content. Our work to supplement our collection with open access e-book content is ever evolving. As the BHL project expands globally and exchanges content with other data pools, the ingest methodology will prove to be a useful tool in our collection development tool box going forward.
  • Transcript

    • 1. Fishing for
      the Right Content
      in a Sea of
      Free E-books
      Bianca Crowley
      Biodiversity Heritage Library Collections Coordinator
    • 2. American Museum of Natural History (New York)
      Academy of Natural Sciences Philadelphia
      Botany Libraries, Harvard University
      California Academy of Sciences (San Francisco)
      Ernst Mayr Library of the Museum of Comparative Zoology, Harvard University
      Field Museum (Chicago)
      Marine Biological Laboratory / Woods Hole Oceanographic Institution
      Missouri Botanical Garden (St. Louis)
      Natural History Museum (London)
      New York Botanical Garden (New York)
      Royal Botanic Garden, Kew
      Smithsonian Institution Libraries (Washington)
      http://biodiversitylibrary.org
    • 3. BHL Digitization Overview
      SELECTION
      Descriptive Metadata
      Images, OCR text, XML
      Taxonomic Name Services
      BHL
      User Feedback
    • 4.
    • 5. Open Access, Open Data
      All materials in the BHL collection are open access:
      Public domain content
      Texts without © renewal
      Permission granted from the © holder
      Download and re-use of metadata and content files encouraged!
      Data export and web services promote the use of BHL data in conjunction with other projects like the Encyclopedia of Life, JSTOR Plant Science and BioStor.
    • 6. “Biologists, are inclined to agree
      that [biodiversity] is, in one sense, everything.”
      – E. O. Wilson
    • 7. Aquaculture
      Animal behavior
      Animal culture
      Animal husbandry
      Agricultural ecology
      Bioacoustics
      Collection & preservation
      Biochemistry
      Biomechanics
      Bioclimatology
      Bioluminescence
      Biogeomorphology
      Coral islands, reefs, & atolls
      Continental drift
      Conservation biology
      Economic botany
      Ecology
      Ecophysiology
      Embryology
      Ecosystems
      Anatomy
      Amphibia
      Algae
      Forestry
      Evolutionary genetics
      Angiosperms
      Arthropoda
      Arachnida
      Genetics
      Geomicrobiology
      Atlases & gazetteers
      Biodiversity conservation
      Horticulture
      Geobiology
      Botany
      Bryology
      Biological diversity
      Classification & nomenclature
      Cyanobacteria
      Extinction
      Evolution
      Endangered species
      Entomology
      Ferns & allies
      Fungi
      Gymnosperms
      Geographical distribution
      Ichthyology
      History of natural sciences
      Linnaean works
      Invertebrates
      Medical botany
      Morphology
      Mollusca
      Mammalia
      Marine biology
      Natural history biographies
      Natural history dictionaries & encyclopedias
      Paleobotany
      Ornithology
      Paleozoology
      Phylogenetic relationships
      Plant anatomy
      Porifera
      Primatology
      Pre-Linnaean works
      Reproduction
      Reptilia
      Protozoa
      Scientific illustration
      Specimen catalogs
      Taxonomy
      Systematics
      Microbial ecology
      Zoology
      Scientific expeditions
      Natural history directories
      Natural history bibliographies
      Natural history terminology
      Oceanography
      Physical anthropology
      Phenology
      Plant conservation
      Plant lore
      Radioecology
      Plant culture
      Plate tectonics
      Plant ecology
      Plant physiology
      Stratigraphy
      Taxidermy
      Restoration ecology
      Soil ecology
      Virology
      Wild animal trade
      Vivariums, terrariums, & aquariums
      Zoos
      Wildlife conservation
    • 8. BHL Member Library Contributed Titles
      Content Selection
      Titles Ingested from the Internet Archive
    • 9. How do you identify the right content in a sea of possibilities?
      Examine your own kind
      Survey the waters
      Cast a
      wide net
      Evaluate
    • 10. One Fish, Two Fish, Red Fish, Fail
      Validation of ice skating protocol to predict aerobic power in hockey players / by Nicholas J. Petrella.
      “50 fish from American waters” by Allen & Ginter (188-)
      A measure of meter conservation in music, based on Piaget's theory / by Mary Louise Serafine.
      “The Gardeners' chronicle :a weekly illustrated journal of horticulture and allied subjects” (1874)
      “Botany” by Brewer, Watson, & Gray 1st & 2nd ed.
      Keeping the body in health, by M. V. O'Shea and J. H. Kellogg
    • 11. Refining our Bait
      Removed broad terms, like “Biology” and “Evolution”
      Identified useful LCSH terms from the titles ingested in November to add to our criteria
      Identified terms associated with content that we wanted to exclude from our collection, e.g. “Gambling” & “Taxation”
      Targeted specific MARC fields to match on call numbers; added a match against predetermined set of Dewey call nos.
      Matched only against MARC ‘650 |a’s where second indicator equaled 0 (LC) or 3 (NAL)
    • 12. Conclusion
    • 13. Thanks!
      http://biodiversitylibrary.org
      Bianca Crowley
      crowleyb@si.edu
      Slides: slideshare.net/lipscombb
      BioDivLibrary
      Biodiversity Heritage Library
      Special thanks to Martin Kalfatovic & Suzanne Pilsk, Smithsonian Institution Libraries