• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Finding a goldmine of natural history illustrations within BHL texts:  the Art of Life project
 

Finding a goldmine of natural history illustrations within BHL texts: the Art of Life project

on

  • 154 views

The Biodiversity Heritage Library (BHL) has now achieved a critical mass of digitized historic texts – over 41 million pages and counting. The BHL portal can be searched by several access points ...

The Biodiversity Heritage Library (BHL) has now achieved a critical mass of digitized historic texts – over 41 million pages and counting. The BHL portal can be searched by several access points including title, author, subject, and scientific name. But, what is largely hidden and entirely unsearchable are the millions of natural history illustrations found with the BHL books and journals. These visual resources which include drawings, paintings, photographs, maps and diagrams represent work by some of the finest botanical and zoological illustrators in the world, including the likes of John James Audubon, Georg Dionysus Ehret, and Pierre Redouté. Many of the illustrations are the first recorded descriptions of much of the world’s biota, providing the scientific foundation for contemporary taxonomic research and conservation assessments. Some of them are the only verifiable resource about an organism and their existence on Earth due to changes in global climate patterns and rapid loss of natural habitat for many species. Audiences for these illustrations also cross a variety of disciplines and include: biologists, artists, historians, illustrators, graphic designers, archivists, educators, students, and citizen scientists.

In 2012, the Missouri Botanical Garden was awarded a grant from the National Endowment for the Humanities to support a project called The Art of Life: Data Mining and Crowdsourcing the Identification and Description of Natural History Illustrations from the Biodiversity Heritage Library (BHL). This talk will discuss the Art of Life objectives and current status. It will go into detail about the algorithms and schema designed for finding which pages contain illustrations and describing the subsequent output. Finally the talk will discuss the project’s benefits for the scientific community such as improving access to a significant collection of public domain images related to biodiversity.

Statistics

Views

Total Views
154
Views on SlideShare
154
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Art of Life evolved out of a need in the BHL that was expressed by our users. We had a critical mass of textual content online, BHL users knew there were amazing images within the BHL pages but there was no easy way to find them other than opening up a BHL book or volume and scrolling through page by page to find illustrations. There is no descriptive metadata attached to the illustration that would tell you the content of the image, date when they were created or who was involved in their creation. We also wanted to expand BHL to new audiences and domains and felt the illustrations were the pathway for doing that. Knew these illustrations would be of interest not only to biologists, but also to artists, historians in both the arts and science, educators; librarians/curators so we wrote a proposal to the National Endowment for the Humanities because we believed they would understand and want to support the disciplinary nature of this content. Luckily they did and awarded Missouri Botanical Garden a grant for the Art of LIfe
  • One way we’ve tried to address the need for image discovery is by pushing selected images to Flickr. We have created a BHL account in Flickr and pushed over 80,000 images so far but this is all a very manual process that takes considerable staff time. We estimate that we have millions of illustrations within BHL so this manual process does not scale well. The address is flickr.com/photos/biodivlibrary
  • This is the Art of Life workflow diagram which identifies the 4 processes the illustrations will go through as they move through each stage of the workflow. They include: Extract, Classify, Describe, and Share.The Extract stage is where BHL pages will be run through the algorithmsto identify which pages contain illustrations, whether they be full plates or only a section of the page. At the Classify stage, the pages with illustrations will be tagged by Art of Life staff as being one or several broad types such as drawing/painting, photograph, diagram, or map. For the Describe stage, the illustrations will be pushed into platforms such as Flickr and Wikimedia Commons where both the general public and specialists can describe them in much greater detail such as adding a title, creator, date (if different from date of publication), and subjects. Wikimedia Commons is where the schema can play a role. Because Wikimedia allows you to create templates we can provide guidance to more expert taggers on what information to record and how to record it. In the Share stage, the metadata contributed in Flickr and Wikimedia Commons will be ingested into the BHL portal both for preservation and discovery. Because many of these new audiences don’t know about BHL and wouldn’t go to the BHL platform to discover the illustrations we also want to push the illustrations out to environments where those audiences are familiar with: Encyclopedia of Life, ARTstor, and even iTunesU where we already have some themed collections at the book level.
  • The team developed a gold standard set to be used as a “control group” to compare results against. This was a set of 100 books and journals whose illustrations were manually tagged with “has illustration”. Accuracy rates are being computed based on how well each algorithm is performing against the gold standard set. ABBYY – relies on metadata output from OCR process which have coordinate information but not always accurateContrast – pretty accurate because the contrast qualities between pages with text and those with images is easily distinguishableColor – pretty much useless. Probably due to many of the older texts exhibiting yellowing or poor color qualitiesCompression – not useful enough to be usedDecided to go with ABBYY and Contrast
  • This is the interface that IMA built for us to review the performance of the algorithmsThis information on top shows the total pages in a book, actual # of illustrations (based on gold standard set) and accuracy rating for ABBYY and Contrast The information on upper right allows you to filter by true positive, true neg, false pos, and false negatives.Each page image is then shown with its bounding coordinates and overall coverage. This allows us to play around with the coverage percentage we can determine if pages with 10% coverage are really illustrations or mostly anomalies like ilustrated letter or artifacts on the page.
  • The Classification tool that will be used by staff for identifying which broad type of illustration each page contains was developed by Joel Richard of the Smithsonian Libraries. He modified an existing tool called Macaw that BHL currently uses to add volume and page level metadata to its books. Its sort of a light table view of all the pages in a book that allows you to quickly highlight several images and globally assign map, or drawing, etc.
  •  A challenge for this project wasto identify the schema, or perhaps schemas, that can serve the metadata needs of a mix of audiences. For example, an art historian reviewing an illustration may be interested in knowing the artist and geographic location where the work was created in order to understand how the artist was influenced by his or her locality. A scientist, considering the same illustration, may be interested in knowing the species name and geographic distribution of the organism depicted in the illustration to compare the development of the species with related species from that area. Both have a need for the geographic metadata contained within the text, but from different perspectives.Since we wanted to push these illustrations out into other platforms for crowdsourcing the descriptions and then bring that metadata back into the BHL platform we needed a schema that would help guide users in what information to contribute and how to record it and also to create some consistency in those descriptions so they are easier to bring back to BHL Rather than inventing a new schema from scratch we really wanted to adopt an existing schema or schemas so that when we shared the described illustrations beyond the BHL, the metadata could easily interoperate with data in other systems .
  • VRA Core was designed for images of artworks and the images that serves as surrogates for them. LIDO was designed for museum objects and has begun to supercede CDWA. Dublin Core of course is the default standard to consider for any online digital repository Darwin Core and Audubon core need no introduction in this communityI have to confess here that I have some personal bias towards VRA Core because I have been involved in the development and maintenance of version 4. But ultimately we determined that VRA Core really was found to be the best fit for the natural history illustrations. Its elements and attributes were mostly closely aligned with the types of information we wanted users to record. But also because its relationship of works to one or more images fit nicely with the book structure which often contain one or more illustrations on a single page. The only thing the VRA Core lacked was a way to record an acceptedName and CommonName for a species. VRA Core has a subject attribute type of scientificName but Taxonomists need more specificity. Darwin Core was able to fulfill this need and so we borrowed 2 elements from that schema.
  • We ended up with 9 elements total, 7 of which came from VRA Core 4.0 and 2 which came from Darwin Core. The elements in red are required but since Date, Copyright and Source are pulled directly from the bibliographic citation for the book the tagger really only has to enter Title and Type. The value for Title we recommend either pulling from a caption if it exists or doing a basic description of the objects in the image. For Type BHL staff will apply at least one of 5 broad types: drawings/paintings; maps; photographs; diagrams; or prints and this gets added during the classification stage that I mentioned.
  • Here is an illustration described using the schema

Finding a goldmine of natural history illustrations within BHL texts:  the Art of Life project Finding a goldmine of natural history illustrations within BHL texts: the Art of Life project Presentation Transcript

  • Finding a goldmine of natural history illustrations within BHL texts: the Art of Life project TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • BHL Problem statement – users want access to images, access to images is limited – How to broaden the audiences for BHL content? TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • What is Art of Life? • Full title - The Art of Life: Data Mining and Crowdsourcing the Identification and Description of Natural History Illustrations from the Biodiversity Heritage Library (BHL) • Grant given to Missouri Botanical Garden in St Louis • Funded by National Endowment for the Humanities • Runs May 2012-April 2014 TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • 5 Primary Objectives of Art of Life Objective 1: Define an appropriate metadata schema for natural history illustrations Objective 2: Build software tools to automatically identify illustrations in the BHL corpus Objective 3: Enhance existing tools to enable the initial sorting, viewing, and editing of these identified visual resources. Objective 4: Integrate tagging applications to enable a community of users to edit descriptive metadata for the illustrations Objective 5: Integrate the descriptive metadata generated by users back into BHL portal both for access and preservation TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • Current status of Art of Life • Development of the algorithms are complete. Running them across entire BHL corpus now. • Draft schema for describing natural history illustrations was posted for public review http://tinyurl.com/9hm7nsb. In process of converting to an application profile • Classifier tool complete TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • Algorithms • Developed by folks at Indianapolis Museum of Art (IMA) Lab. • Built 4 primary types: – – – – ABBYY (87% accurate) Contrast (88% accurate) Color (.09% accurate) Compression (9% accurate) • Tested against a gold standard set of 100 books (40k pages) • ABBYY and Contrast were chosen as most effective in finding illustrations TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • Interface designed for BHL to assess performance of algorithms TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • Interface developed to assign broad classes TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • Art of Life Schema Needs to support three objectives: 1) to enable the discovery, description and use of the identified images by artists, biologists, humanities scholars, librarians, and educators 2) to make BHL’s metadata and images available to other platforms 3) to import crowdsourced metadata generated in other platforms back into BHL. TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • Schema landscape review – VRA Core 4.0 (art image community) – LIDO (museum community) – Dublin Core (Web community) – Darwin Core (biodiversity community) – Audubon Core (biodiversity community) TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • ART OF LIFE SCHEMA ELEMENTS red =required Title Type Date Copyright Source Agent Subjects Description Inscription TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • Example of illustration described using Art of Life schema Title Stictospiza formosa Type Paintings Date Publication: 1898 Agent Description Subjects Inscriptions Author: Arthur G. Butler (1844-1925) Illustrator: F.W. Frohawk (1861-1946) A pair of finches with green and yellow bodies resting on reeds Birds, finches Scientific name: Amandava formosa Vernacular Name: Green Avadavat or Green Munia Accepted Name: Amandava formosa (Latham, 1790) bottom center: Green Amaduvade Waxbill (Stictospiza formosa) Source Rights TDWG Oct 2013 Florence Italy Butler, Arthur Gardiner. Foreign finches in captivity. Hull and London: Brumby and Clarke, limited,1889 (2nd edition). This image comes from the Biodiversity Heritage Library, and is available online at biodiversitylibrary.org/page/17195895 Public domain Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • How will this project benefit the scientific community? • • • Will provide access to content in BHL that has been largely hidden and difficult to find. Functionality will be added to the BHL portal to allow searching for images by species name, common names, subjects, and illustrators Once the images are available and described in places like Flickr and Wikimedia Commons they will become easily linked to and available in other biodiversityrelated platforms such as Wikispecies and EOL Like the text content in BHL, most image content will fall under public domain and be freely available for download and re-use so you can incorporate them into your research and publications TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • Art of Life team PI Trish Rose-Sandler, Missouri Botanical Garden Algorithm development Ed Bachta, Charlie Moad, Kyle Jaebker, Indianapolis Museum of Art Schema development Gaurav Vaidya and Robert Guralnick, University of Colorado, Boulder William Ulate, Missouri Botanical Garden Programming Mike Lichtenberg, Missouri Botanical Garden Consultants Doug Holland, Missouri Botanical Garden; Chris Freeland, Washington University (former PI for Art of Life) TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • Interested? Here’s how you can help • We welcome your feedback on the schema before its finalized! http://tinyurl.com/9hm7nsb • Would love to talk with other folks about their experiences with crowdsourcing of metadata, particularly if you’ve used flickr or Wikimedia commons • Spread the word about this free, rich resource of images http://www.flickr.com/photos/biodivlibrary and help us describe our illustrations! TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project
  • For more info http://biodivlib.wikispaces.com/Art+of+Life Contact trish.rose-sandler@mobot.org TDWG Oct 2013 Florence Italy Trish Rose-Sandler, Missouri Botanical Garden Art of Life project