The document summarizes efforts to expand access to biodiversity literature through the Biodiversity Heritage Library (BHL) consortium. Key points:
- BHL is an open-access digital library containing over 180,000 volumes and millions of pages of biodiversity literature.
- The Mining Biodiversity project aims to transform BHL into a next-generation social digital library through text mining, machine learning, and other techniques.
- Semantic metadata is being generated for BHL texts to improve search through disambiguation of terms and expansion of queries using synonyms and related terms. Experiments show this improves search results.
4. Biodiversity Heritage Library
• a consortium of botanical and natural history
libraries
• stores digitised legacy literature on
biodiversity
• currently holds 180,000 volumes = millions of
pages (PDFs and OCR-generated text)
• open-access
45/31/2016 Mining Biodiversity
9. Mining Biodiversity
• Transform BHL into a next-generation social
digital library
• A multi-disciplinary approach
– Text Mining
– Machine learning
– History of Science
– Environmental History & Studies
– Library and Information Science
– Social Media
95/31/2016 Mining Biodiversity
10. What have we done so far?
Social Media
Visualisation
Semantic
Metadata
105/31/2016 Mining Biodiversity
12. We also partnered with Altmetric to better understand who and why
people share BHL content across various social media platforms
Follow us on Twitter: @SMLabTO
13. “My Tweeps” app
mytweeps.com
Helping BHL (and other
organizations) to get daily
insights about their Twitter
followers (or Tweeps) and
what they are interested in.
We call it a "reverse"
Twitter because instead of
seeing tweets from people
whom you follow, the app
shows you tweets from
people who follow you.
Follow us on Twitter: @SMLabTO
16. What are we doing?
Social Media
Visualisation
Semantic
Metadata
165/31/2016 Mining Biodiversity
17. Current features
• supports keyword-based search
• species names annotated and linked to the
Encyclopedia of Life
• integrates automatic taxonomic name finding
tools (uBio Taxonfinder / GNRDS)1
• data access through export functionalities and
Web services
175/31/2016 Mining Biodiversity
1 Global Names Recognition and Discovery tools and Services (GNRDS). See http://gnrd.globalnames.org/
21. What’s wrong with
keyword-based search: Polysemy
•Ambiguity!
Boxwood
historic place in
Alabama?
North American term for
plants in the Buxaceae
family?
Box
container?
Boxwood for other English-
speaking countries?
26. How does semantic
information help?
SPECIES
California bay
hardwood tree
location
California bay
LOCATION
•Word sense disambiguation
27. What’s wrong with
keyword-based search: Synonymy
Campanula
portenschlagiana Schult.
Campanula
portenschlagiana Schult.
Campanula affinis
Rchb. ex Nyman
Campanula muralis
Port ex. A. DC.
28. What’s wrong with
keyword-based search: Synonymy
Clematis L.
Clematis L.
Clematopsis Bojer
ex Hutch.
Atragene L.
Archiclematis
tamura
29. How does semantic
information help?
Campanula
portenschlagiana Schult.
Campanula
portenschlagiana
Schult.
Campanula affinis
Rchb. ex Nyman
Campanula muralis
Port ex. A. DC.
•Query expansion
30. Term Inventory
• compilation of species names (flowering
plants, mammals, birds)
• acts as a thesaurus, as each name is linked to
its synonyms as well as other semantically
related names
• “semantically relatedness”: defined in terms
of a contextual similarity measure, computed
over the entire BHL corpus
31. Sources we leveraged
• Names
– Encyclopedia of Life (EOL)
– Catalogue of Life
– Global Biodiversity Information Facility (GBIF)
• Images
– Encyclopedia of Life (EOL)
32. Experiments
• Training data: all English texts from the BHL
– about 26 million pages with a size of 49GB
• Evaluation data: synonymous terms from the Catalogue of Life
• Select 500 scientific names and their synonyms from the CoL
• Results at top-20
Category Class #terms in
CoL
#terms in
BHL
#average synonyms
in CoL
Birds Aves 1140 818 2.28
Mammals Mammalia 1131 726 2.26
Plants Plantae 1141 826 2.28
Category Pre@20 Re@20
Birds 69.41% 63%
Mammals 62.12% 53.84%
Plants 56.17% 21.43%
33. Application to Query Expansion
• an interface for searching BHL documents
using a species name as a query
• query is automatically expanded by retrieving
synonyms/semantically related names from
the term inventory
• documents mentioning all of the names in the
expanded query are returned
38. And more Magnoliopsida species (common) names
CHOICE 1 CHOICE 2 CHOICE 3
Pyrus cydonia Barbados pride Prenanthes alba
Clianthus dampieri White lupin Yellow pea
Geum intermedium Pyrus melanocarpa Erigeron canadensis
Pyrola uniflora Japanese pagoda tree Epilobium hirsutum
Ampelopsis engelmanni Soybean Salix pentandra
Solanum nodiflorum Exogonium purga Lathyrus montanus
Ribes floridum Impatiens biflora Stellaria media
Orobus tuberosus Cassia marilandica Cnicus discolor
Medicago maculata Melilotus indica Apium nodiflorum
Glycine soja Balsam of tolu Juglans laciniosa
Stellaria longifolia Salix arctica Purging cassia
Echinospermum lappula Umbrella tree Potentilla pumila
39. Thank you
William Ulate
Missouri Botanical Garden
william.ulate@mobot.org
Photo: W.Ulate. Corcovado National Park, Costa Rica. 2013
This project was made possible in part by
[LG-00-14-04-0032-14]
Riza Batista-Navarro
NaCTeM, University of Manchester
riza.batista@manchester.ac.uk