Be the first to like this
BHL is home to most of the world’s biodiversity legacy literature. In order to allow its users to find information in a more focused and efficient manner, efforts towards the development of a semantically enabled search engine are currently underway. To this end, semantic metadata in the form of concept annotations has been automatically extracted over the BHL collection using text mining (TM) techniques. This was carried out in a series of stages: (1) producing a moderately sized BHL corpus in which concepts have been manually marked up and assigned semantic labels, e.g., taxon, location, anatomical entity, habitat; (2) training machine learning-based concept recognition models on the said corpus; (3) applying the trained models on BHL documents in order to automatically recognize and assign semantic labels to concepts; and (4) automatically linking together semantically related concepts using distributional similarity methods. BHL documents were then indexed according to the semantic annotations automatically generated by the above-described TM methodology. This facilitates the incorporation of the following system features into BHL’s search engine: (1) query expansion, which helps a user widen his search through automatic suggestion of synonyms; and (2) semantic facets, which the user can specify to narrow down search results in order to filter out documents pertaining to unwanted word senses.