Collection Assessment in a Collaborative
Environment: BHL
Connie Rinaldo, Bianca Crowley, Trish
Rose Sandler & William Ula...
The BHL is…
• A consortium of 15 natural history, botanical libraries and
research institutions
• An open access, full-tex...
BHL Goals
• Goal 1: Relevant Content: Build & maintain the BHL as the largest reliable,
reputable, & responsive repository...
Core BHL Member Institutions
Global Partners
http://biodiversitylibrary.org

Now online
64,188 titles

120,461 volumes
42 million+ pages
BHL Overview
• New user interface launched in March
• Search by title, author, article, subjects and scientific
names
• Va...
Core Principles
• Open access
• Open data
• Deconstruct the silo and deliver content where users are
already working
– Via...
Scanning Locally, Coordinating Globally

Vols. 6,
8, 10

Issue Tracking
Software

Vols.
1-5

Vols. 7,
9, 11-21
Beyond the Silo: Open Data
Stable
URLs

Open Data
Policy

APIs
Application
Programming
Interfaces

Data
Exports

OAI-PMH
O...
User Feedback is Critical
General feedback form

http://biodiversitylibrary.org/contact

Scan request form
Impact
•

“BHL came to the rescue when a planned trip to work in the Mertz Library at The New
York Botanical Garden had to...
Biodiversity
Literature

BHL

EOL

Scientific Names

Researchers

Publications

Datasets
Collecting
Events Specimens
Local...
Questions about BHL Content
• How many books in BHL are there about....?
• How can we identify areas of weakness in BHL
in...
Questions about BHL Content
• What are scalable solutions to content
analysis?
• Can we provide creative & meaningful
visu...
Why do we care about taxonomic
names?
• Scientists use taxonomic names to organize
their research
• Biodiversity literatur...
Extracted Scientific Names
What is “Taxonomic Intelligence”?
• Global Names Recognition & Discovery tool
– Locate, verify, record scientific names fr...
Overview of available BHL (meta) data
http://biodivlib.wikispaces.com/Data+Exports
• Title metadata: contributed from MARC...
Data Exports
Visualization of BHL Data for Pinusbanksiana
Source Data Sample
Sample BHL & Nomenclatural Data
• Google Refine reconciled list of BHL subject keywords
• List of vetted BHL subject targe...
OtherTools& Process
• Bibliographies (discipline & more)
• Index Animalium: identifies first appearance of 400,000 animals...
SAMPLE VISUALIZATIONS
Core & Supporting Keywords for BHL Collections
Wordle for BHL Content
http://public.tableausoftware.com/views/BHLViz/DigitizedSubjects
Visualization Opportunities
• JournalMap (geo tagging scientific
literature) http://www.journalmap.org/
• Visualizing arti...
Taxon Data Manipulation
Opportunities
• Euler Project: Reasoning with Taxonomies:
http://euler.cs.ucdavis.edu/
• REST & Ta...
SUMMARY
•
•
•
•

Metadata reconciliation
Gap analysis
Visualizations
All automated!
Thank you for your
Help!
http://biodiversitylibrary.org
Connie Rinaldo
crinaldo@oeb.harvard.edu
Collection assessment in a collaborative environment: Biodiversity Heritage Library
Collection assessment in a collaborative environment: Biodiversity Heritage Library
Collection assessment in a collaborative environment: Biodiversity Heritage Library
Upcoming SlideShare
Loading in …5
×

Collection assessment in a collaborative environment: Biodiversity Heritage Library

359 views
320 views

Published on

Published in: Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
359
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • GOALS:
  • A free & open access digital library for biodiversity literature and primary source materials (field books)A consortium of 15 libraries working together to run a virtual library branchA collection of content from the 15 member BHL consortium and other Internet Archive contributorsAnyone is free to access & download BHL materials
  • SEARCH: Subject searching in BHL via the advanced search http://biodiversitylibrary.org/advsearch"subjects" tab is searching through the table of subject keywords we have in BHL, derived from the LCSH. It does NOT search titles or scientific names. If you do a basic keyword search via the homepage for a subject term, say "Birds", you will pull hits across all titles, articles, authors, subjects and scientific names broken out by tabs. Notice that the subjects tab shows all search results where "birds" is a part of the subject keyword string such as "Birds of prey" or "Cage birds".
  • COLLABORATION!
  • Add images…Also add DOIs?
  • User feedback is key; we rely on the many eyes of the crowd to help us direct our curation activities to the content people are actually usingUsers can let us know if they find a problem with something in our collection through our general feedback form and place a request for something to be scanned through our scanning request form
  • The trees of north america, entomology, or bears: metadata, right? BUT LCSH doesn’t adequately describe the biodiversity literature. Scientists organize around scientific names, articles, and parts of articles (species descriptiond)Rod Page did this: constructed a table listing all the journals in BioNames that have an ISSN, ordered by the number of articles in BioNames (i.e., mostly articles that publish new names). The full table is here, I've reproduced part of it below (limited to those journals with at least 500 articles in BioNames)
  • From Rod PageRod Page did this: constructed a table listing all the journals in BioNames that have an ISSN, ordered by the number of articles in BioNames (i.e., mostly articles that publish new names). The full table is here, I've reproduced part of it below (limited to those journals with at least 500 articles in BioNames)
  • The trees of north america, entomology, or bears: metadata, right? BUT LCSH doesn’t adequately describe the biodiversity literature. Scientists organize around scientific names, articles, and parts of articles (species descriptiond)
  • The Biodiversity Heritage Library uses taxonomic intelligence tools, including Global Names Recognition and Discovery (GNRD) developed by Global Names Architecture, to locate, verify, and record scientific names located within the text of each digitized page. The Note: The text used for this identification is uncorrected OCR, so may not include all results expected or visible in the pageThis names-based index is an incredibly valuable tool for organismal research, and is easily incorporated into external web sites through two different methods of access.
  • Bold= focus for this session—what we have provided on library boxNames aEach dataset has its own complexity: - taxonomic names have a. hierarchy (the previous to last is an infraspecific taxonomic level: forma) b. change over time (the 4th one in the list Pinusdivaricata is a synonym)c. and have all sorts of exceptions to the rules (the last one Pinus X murraybanksiana is a hybrid) - common names are a. subjective, biased towards organisms of well known groups onlyb. are dependant on language, region and time. - subjects are a. language dependantb. hierarchicalc. at title levelre extracted from OCR text
  • These have all been provided on Library Box, in addition to some more specific setsAlso have MODS, Endnote and BibTex files for titles, items/volumes and parts
  • A visualizaton of BHL data (for Pinusbanksiana)How do we reconcile all of this to find out what content covers our question? How can we map the more specific terms to LCSH/call numbers when we have limited resources--we need to automate as much as possible.  We want consistent language.  The BHL uses LC for the volumes but also pulls out scientific names.  How do we get them incorporated into the consistent language of LC in an automated way that can scale?  We want to know what we have so we can compare to an (as yet) unidentified universe.  (bibliographies, index animalium, TL2)A picture of BHL data (for Pinusbanksiana as it appears in page 140 of v.78 of The Canadian field naturalist)How do we reconcile all of this to find out what content covers our question? How can we map the more specific terms to LCSH/call numbers when we have limited resources--we need to automate as much as possible.  We want consistent language.  The BHL uses LC for the volumes but also pulls out scientific names.  How do we get them incorporated into the consistent language of LC in an automated way that can scale?  We want to know what we have so we can compare to an (as yet) unidentified universe.  (bibliographies, index animalium, TL2)Each dataset has its own complexity: - taxonomic names have a. hierarchy (the previous to last is an infraspecific taxonomic level: forma) b. change over time (the 4th one in the list Pinusdivaricata is a synonym) c. and have all sorts of exceptions to the rules (the last one Pinus X murraybanksiana is a hybrid) - common names are a. subjective, biased towards organisms of well known groups only b. are dependant on language, region and time. - subjects are a. language dependant b. hierarchical c. at title level
  • To show that name data come from multiple sources
  • BOLD means in library boxGoogle refine:  what they are and implications for collection analysisThese are links to
  •  index animalium, TL2; Literature breaks down by discipline and even by specific taxon; scientific names and bibliographic structure are different and we are trying to merge the two: looking at scientific data next to library data but have to make sense of the merger in the library world (see coll dev chart).  Scientists work at an article/name/article part level; we work on the level of the volume.Taxonomic Literature: A selective guide to botanical publications and collections with dates, commentaries and types (Stafleu et al.).TL-2 is the premier publication of the International Association for Plant Taxonomy (IAPT), TL-2 is a 15 volume guide to the literature of systematic botany published between 1753 and 1940. It is organized by author and includes numbered entries for the author's publications. How can we map back to LCSH/call numbers when we have limited resources--we need to automate as much as possible.  We want consistent language.  The BHL uses LC for the volumes but also pulls out scientific names.  How do we get them incorporated into the consistent language of LC in an automated way that can scale?  We want to know what we have so we can compare to an (as yet) unidentified universe.  (bibliographies, index animalium, TL2)IndexAnimalium is Sherborn’s life’s work—9000 page bibliography identifying the first book in which over 400,000 organisms appeared; covers 1758-1850LENGTHY process is all of this! Needs more automationZoological Record: is the world's oldest continuing database of animal biology. It is considered the world's leading taxonomic reference, and with coverage back to 1864, has long acted as the world's unofficial register of animal names.Early on we compared the universe of what is in the big libraries to what was in BHL and that allowed us to fill gaps:  https://bhl.wikispaces.com/BHL+Priority+Titles
  • These are keywords that we use to describe how we collect for BHL. These are adapted from LC but not necessarily actual subject heading. We modified some terms to make the language clear and bring in some of the scientific naming conventions (Ornithology instead of birds). This was meant to merge appropriate parts of the library and scientific world. This is the consistent language against which we want to compare BHL content.
  • Many irrelevant features; breaks up phrases (united states) At least is shows that we have lots of BOTANY (but we would want to merge that with plants) .
  • This shows the distribution of keywords for items scanned by the Ernst Mayr Library of the Museum of Comparative Zoology (good thing zoology shows up as a big piece). This was made using tableau software—all of the tiny items can be identified but like wordle, lots of irrelevant stuff. How can we automate the improvement and appropriate merging of metadata? http://public.tableausoftware.com/views/BHLViz/DigitizedSubjects
  • Collection assessment in a collaborative environment: Biodiversity Heritage Library

    1. 1. Collection Assessment in a Collaborative Environment: BHL Connie Rinaldo, Bianca Crowley, Trish Rose Sandler & William Ulate
    2. 2. The BHL is… • A consortium of 15 natural history, botanical libraries and research institutions • An open access, full-text digital library for legacy biodiversity literature. • An open data repository of taxonomic names and bibliographic information • An expanding global effort • Mission: The Biodiversity Heritage Library improves & makes more efficient the methodology of research in biodiversity studies by collaboratively making biodiversity literature openly available to the world as part of a global biodiversity community.
    3. 3. BHL Goals • Goal 1: Relevant Content: Build & maintain the BHL as the largest reliable, reputable, & responsive repository of biodiversity literature & archival materials. • Goal 2: Tools & Services: Develop services & tools which facilitate discovery & improve research efficiency of BHL content. • Goal 3: User Engagement: Increase global awareness about the BHL through outreach, learning & education, & branding through engagement & collaboration with existing & new user communities. • Goal 4: Membership & Partnerships: Grow BHL consortia membership & partnerships while fostering cross-institutional collaboration that continues to serve as a model for digital library development • Goal 5: Financial Sustainability: Ensure sustainability & relevance by being flexible, adaptable, & financially sound while the content & services remain openly & freely available.
    4. 4. Core BHL Member Institutions
    5. 5. Global Partners
    6. 6. http://biodiversitylibrary.org Now online 64,188 titles 120,461 volumes 42 million+ pages
    7. 7. BHL Overview • New user interface launched in March • Search by title, author, article, subjects and scientific names • Various download options, including high resolution • Taxonomic name finding algorithm • Machine-to-machine services • Full-text search being tested
    8. 8. Core Principles • Open access • Open data • Deconstruct the silo and deliver content where users are already working – Via other biodiversity websites and taxonomic resources – Via social media platforms: blog, flickr, Facebook, Twitter, Pinterest, &etc. • Involve users in collection and technical development activities
    9. 9. Scanning Locally, Coordinating Globally Vols. 6, 8, 10 Issue Tracking Software Vols. 1-5 Vols. 7, 9, 11-21
    10. 10. Beyond the Silo: Open Data Stable URLs Open Data Policy APIs Application Programming Interfaces Data Exports OAI-PMH Open Archive Initiative – Protocol for Metadata Harvesting
    11. 11. User Feedback is Critical General feedback form http://biodiversitylibrary.org/contact Scan request form
    12. 12. Impact • “BHL came to the rescue when a planned trip to work in the Mertz Library at The New York Botanical Garden had to be cancelled due to Hurricane Sandy. Thanks to the online resources available through BHL I was able to source most of the key works I needed, with their supporting bibliographic information. Further use of BHL occurred when building work at the Linnean Society of London limited access to some of the book I had been able to use from that collection." • “I would like thank you all very much for invaluable work and support you do. I just got a pdf-file from more than century old (1893) journal paper (regional naturalist society paper, published in Finland), to get copy I should take 500 mile drive to our university library. Now I am got it fastly in high-quality pdf-copy. Cordial thanks and all success in continuing your highly valuable mission.” [conservation biologist from Estonia] • “You are a wonderful resource. I maintain a Website that describes the plant genus Opuntia (prickly pear cacti). There is no way I could maintain such a site without access to literature from 100-200 years ago. Most of the cactus species were discovered long ago; I find it invaluable to put up PDF files to document each species in the literature as I document them photographically. I am a botanist, but I work in the pharmaceutical field (not so many botanical jobs out there). Your library makes it possible for me to continue working with plants in a meaningful and scientific manner.”
    13. 13. Biodiversity Literature BHL EOL Scientific Names Researchers Publications Datasets Collecting Events Specimens Localities Field Notes Phylogenies Nomenclators Name Species Checklists Indexes Content Aggregators
    14. 14. Questions about BHL Content • How many books in BHL are there about....? • How can we identify areas of weakness in BHL in order to prioritize what materials to scan next? • Rod Page has one suggestion: http://iphylo.blogspot.com/2013/10/whichtaxonomic-journals-should-be.html
    15. 15. Questions about BHL Content • What are scalable solutions to content analysis? • Can we provide creative & meaningful visualizations?
    16. 16. Why do we care about taxonomic names? • Scientists use taxonomic names to organize their research • Biodiversity literature breaks down by discipline & by specific taxon
    17. 17. Extracted Scientific Names
    18. 18. What is “Taxonomic Intelligence”? • Global Names Recognition & Discovery tool – Locate, verify, record scientific names from each page – Text is uncorrected OCR
    19. 19. Overview of available BHL (meta) data http://biodivlib.wikispaces.com/Data+Exports • Title metadata: contributed from MARC records of hundreds of library catalogs (BHL consortium libraries & non-BHL IA contributors) • Volume/item metadata: provides information about the actual objects & pieces digitized • Subject • Creator/author data • Segment/part/”article” metadata (separate table for segment/part creators?) • Page metadata which includes our algorithmically identified scientific name data • OCR text available at the item/volume level but not overall for corpus of BHL
    20. 20. Data Exports
    21. 21. Visualization of BHL Data for Pinusbanksiana
    22. 22. Source Data Sample
    23. 23. Sample BHL & Nomenclatural Data • Google Refine reconciled list of BHL subject keywords • List of vetted BHL subject targets from collection development policy • Taxonomic name data set for trees of North America (link out) • http://www.fs.fed.us/database/feis/plants/tree/ind ex.html • http://www.treesofnorthamerica.net/ • Subject terms associated with BHL titles where Pinus banksiana occurs
    24. 24. OtherTools& Process • Bibliographies (discipline & more) • Index Animalium: identifies first appearance of 400,000 animals from 1758-1850 • Researcher supplied specific taxon bibliographies • Zoological Record: Taxonomic references back to 1864. • Taxonomic Literature II: a selective guide to botanical publications with dates, commentaries and scientific types • Compare universe of biodiversity literature to BHL • Unknown dataset for full universe • Compared BHL member collections to BHL content for gap-filling before content expanded (lists automated but gap identification manual) • REST especies: a way to collate species metadata? http://dopaservices.jrc.ec.europa.eu/services/especies/ • DOPA Explorer http://ehabitat-wps.jrc.ec.europa.eu/dopasimple/
    25. 25. SAMPLE VISUALIZATIONS
    26. 26. Core & Supporting Keywords for BHL Collections
    27. 27. Wordle for BHL Content
    28. 28. http://public.tableausoftware.com/views/BHLViz/DigitizedSubjects
    29. 29. Visualization Opportunities • JournalMap (geo tagging scientific literature) http://www.journalmap.org/ • Visualizing article performance http://bit.ly/1c4TJfn • Better Life Index http://www.oecd.org/statistics/datalab/bli.htm • Altmetric: http://www.altmetric.com/ • Tableau http://www.tableausoftware.com/public/ • Worth it: http://www.wired.com/wiredscience/2013/11/wireddata-life-martin-krzywinski/?viewall=true
    30. 30. Taxon Data Manipulation Opportunities • Euler Project: Reasoning with Taxonomies: http://euler.cs.ucdavis.edu/ • REST & Taxonomy: https://drupal.org/project/taxonomy_api
    31. 31. SUMMARY • • • • Metadata reconciliation Gap analysis Visualizations All automated!
    32. 32. Thank you for your Help! http://biodiversitylibrary.org Connie Rinaldo crinaldo@oeb.harvard.edu

    ×