Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Unlocking knowledge in biodiversity
legacy literature through automatic
semantic metadata extraction
Riza Batista-Navarro,...
Structured
Data
? Text
Mining
http://miningbiodiversity.org
The partners
Social Media Lab
410/9/2015 Mining Biodiversity
Mining Biodiversity
• Transform BHL into a next-generation social
digital library
• A multi-disciplinary approach
– Text M...
What do we want to do?
Social Media
Visualisation
Semantic
Metadata
610/9/2015 Mining Biodiversity
Biodiversity Heritage Library
• a consortium of botanical and natural history
libraries
• stores digitised legacy literatu...
Current features
• supports keyword-based search
• species names annotated and linked to the
Encyclopedia of Life
• integr...
Keyword-based search
and Browsing
Advanced search
(also keyword-based)
10/9/2015 10Mining Biodiversity
What’s wrong with
keyword-based search?
• Ambiguity!
Boxwood
historic place in
Alabama?
North American term for
plants in ...
What’s wrong with
keyword-based search?
• Ambiguity!
California bay
hardwood
tree?
location?
Drum
musical
instrument?
fish?
What’s wrong with
keyword-based search?
• Ambiguity!
Emperor
fish?
person?
Scrambled eggs
food?
plant?
Semantic metadata generation
• Entity types
– species
– location
– habitat
– anatomical parts
– qualities
– persons
– temp...
Examples of semantic metadata
(annotations)
• Observation
• Habitation
Examples of semantic metadata
(annotations)
• Nutrition
• Trait
How does semantic
information help?
SPECIES:
California bay
hardwood tree
location
LOCATION:
California bay
Text mining-based approach
Seed
documents
Unlabelled
documents
Learn semantics
Annotator/Curator
Validate
Feedback
Annotat...
Automatic annotation by
text mining (TM)
– Web-based, graphical TM workbench
– conforms with the Unstructured Information
...
interface
10/9/2015 20Mining Biodiversity
Learning semantics
• Training of models using machine learning
– conditional random fields (CRFs) for sequence
labelling
–...
interface
10/9/2015 22Mining Biodiversity
Annotation workflowPre-
processing
Dictionary
lookup
Machine
learning-based
recognition
Relation
extraction
Saving
Validation interface
Enhanced searching of BHL content
Faceted
search
Automatically
generated
questions
Time-
sensitive
search
Enhanced document viewing
Page in
PDF/image
format
OCR-corrected text
with colour-coded
annotations
Conclusions
• Literature is a rich source of information but
difficult to search
• Keyword-based search not enough to addr...
Upcoming SlideShare
Loading in …5
×

Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

334 views

Published on

BHL is home to most of the world’s biodiversity legacy literature. In order to allow its users to find information in a more focused and efficient manner, efforts towards the development of a semantically enabled search engine are currently underway. To this end, semantic metadata in the form of concept annotations has been automatically extracted over the BHL collection using text mining (TM) techniques. This was carried out in a series of stages: (1) producing a moderately sized BHL corpus in which concepts have been manually marked up and assigned semantic labels, e.g., taxon, location, anatomical entity, habitat; (2) training machine learning-based concept recognition models on the said corpus; (3) applying the trained models on BHL documents in order to automatically recognize and assign semantic labels to concepts; and (4) automatically linking together semantically related concepts using distributional similarity methods. BHL documents were then indexed according to the semantic annotations automatically generated by the above-described TM methodology. This facilitates the incorporation of the following system features into BHL’s search engine: (1) query expansion, which helps a user widen his search through automatic suggestion of synonyms; and (2) semantic facets, which the user can specify to narrow down search results in order to filter out documents pertaining to unwanted word senses.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

  1. 1. Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction Riza Batista-Navarro, William Ulate, Jennifer Hammock, Georgios Kontonatsios, Trish Rose-Sandler and Sophia Ananiadou
  2. 2. Structured Data ? Text Mining
  3. 3. http://miningbiodiversity.org
  4. 4. The partners Social Media Lab 410/9/2015 Mining Biodiversity
  5. 5. Mining Biodiversity • Transform BHL into a next-generation social digital library • A multi-disciplinary approach – Text Mining – Machine learning – History of Science – Environmental History & Studies – Library and Information Science – Social Media 510/9/2015 Mining Biodiversity
  6. 6. What do we want to do? Social Media Visualisation Semantic Metadata 610/9/2015 Mining Biodiversity
  7. 7. Biodiversity Heritage Library • a consortium of botanical and natural history libraries • stores digitised legacy literature on biodiversity • currently holds 160,000 volumes = millions of pages (PDFs and OCR-generated text) • open-access 710/9/2015 Mining Biodiversity
  8. 8. Current features • supports keyword-based search • species names annotated and linked to the Encyclopedia of Life • integrates automatic taxonomic name finding tools (uBio Taxonfinder) • data access through export functionalities and Web services 810/9/2015 Mining Biodiversity
  9. 9. Keyword-based search and Browsing
  10. 10. Advanced search (also keyword-based) 10/9/2015 10Mining Biodiversity
  11. 11. What’s wrong with keyword-based search? • Ambiguity! Boxwood historic place in Alabama? North American term for plants in the Buxaceae family? Box container? Boxwood for other English- speaking countries?
  12. 12. What’s wrong with keyword-based search? • Ambiguity! California bay hardwood tree? location? Drum musical instrument? fish?
  13. 13. What’s wrong with keyword-based search? • Ambiguity! Emperor fish? person? Scrambled eggs food? plant?
  14. 14. Semantic metadata generation • Entity types – species – location – habitat – anatomical parts – qualities – persons – temporal expressions • Association types – observation – Habitation – nutrition – trait 10/9/2015 Mining Biodiversity 14
  15. 15. Examples of semantic metadata (annotations) • Observation • Habitation
  16. 16. Examples of semantic metadata (annotations) • Nutrition • Trait
  17. 17. How does semantic information help? SPECIES: California bay hardwood tree location LOCATION: California bay
  18. 18. Text mining-based approach Seed documents Unlabelled documents Learn semantics Annotator/Curator Validate Feedback Annotate Search index Store Annotate
  19. 19. Automatic annotation by text mining (TM) – Web-based, graphical TM workbench – conforms with the Unstructured Information Management Architecture (UIMA) standard – facilitates the straightforward integration of various analytics into workflows – allows for the validation of annotations 10/9/2015 Mining Biodiversity 19
  20. 20. interface 10/9/2015 20Mining Biodiversity
  21. 21. Learning semantics • Training of models using machine learning – conditional random fields (CRFs) for sequence labelling – learning the features of mentions and relations of interest based on labelled documents • contextual features: surrounding, co-occurring words • dictionary matches: presence of certain words in controlled vocabularies, e.g., Catalogue of Life, Phenotype and Trait Ontology, Gazetteer 10/9/2015 Mining Biodiversity 21
  22. 22. interface 10/9/2015 22Mining Biodiversity
  23. 23. Annotation workflowPre- processing Dictionary lookup Machine learning-based recognition Relation extraction Saving
  24. 24. Validation interface
  25. 25. Enhanced searching of BHL content Faceted search Automatically generated questions Time- sensitive search
  26. 26. Enhanced document viewing Page in PDF/image format OCR-corrected text with colour-coded annotations
  27. 27. Conclusions • Literature is a rich source of information but difficult to search • Keyword-based search not enough to address ambiguity • Semantic metadata allows for more accurate searching • Semantic metadata can be extracted using text mining tools • The Argo text mining workbench facilitates the construction of custom semantic metadata generation workflows

×